Multiprocessing, python

What is the difference between multithreading and multiprocessing?
?

Multithreading vs multiprocessing

Multiprocessing is the ability of a system to run multiple processors in parallel, where each processor can run one or more thread. Multi-threading refers to the ability of a processor to execute multiple threads concurrently. Python multithreading does not work well for CPU bound tasks, but well for I/O tasks due to the way it is implemented.

What are embarrassingly parallel functions?
?
Functions where little or no effort needs to be put it to parallelise them. This includes:

Implementation of Pool

Please give an example of how you would parallelise the following function:

def compute_square(num):
    return num * num

# List of numbers to compute squares for
numbers = [1, 2, 3, 4, 5]

squares = []
for number in numbers:
	squares.append(compute_square(number))

print("Input Numbers: ", numbers)
print("Squared Numbers: ", squares)

?

Implementation

Parallelised code:

If the results can be computed from partial results from partial inputs, the problem is called embarrassingly parallel
Notice that we added the {python}if __name__ == "__main__":statement. If we didn't do that, each child process would load the file again, and infinitely spawn processes. Yes multiprocessing is weird
from multiprocessing import Pool

# Function to compute the square of a number
def compute_square(num):
    return num * num

# List of numbers to compute squares for
numbers = [1, 2, 3, 4, 5]

def run(numbers):
	# Create a Pool object with 4 processes
	with Pool(processes=4) as pool:
	    # Map the function to the list of numbers
	    results = pool.map(compute_square, numbers)
	
	print("Input Numbers: ", numbers)
	print("Squared Numbers: ", results)

if __name__ == "__main__":
	run(numbers)

Most of the cases you then only need to combine the results somehow.

Variables created within processes are not shared. They get destroyed after the process has finished executing. This is why we require special variables like pool or queues from the multiprocessing library.

How do you decide how many processes you need or can spawn?
?

How many processes to spawn?

  1. Simple problems, with limited impact: Just hardcode some number, don't overthink it
  2. CPU bound: Use as many as there are cpu cores:
import os

num_cores = os.cpu_count()
  1. I/O bound: You can go above the number of cores. If it is heavily I/O bound, then consider going 2, 3 times above the amount of cpu cores.