When programming custom algorithms in python, it can happen that our code becomes slow because we run a couple of nested for-loops. If the inner loops do not depend on each other, code can be parallelized and sped up. Note, we are parallelizing code on a central processing unit (CPU) not not mix it up with GPU-acceleration that uses graphics processing units (GPUs).
import time import numpy as np from functools import partial import timeit import matplotlib.pyplot as plt import platform
We start with an algorithm that does something with an image at given pixel coordinates
def slow_thing(image, x, y): # Silly algorithm for wasting compute time sum = 0 for i in range(1000): for j in range(100): sum = sum + x sum = sum + y image[x, y] = sum
image = np.zeros((10, 10))
We now use timeit to measure how long the operation takes for processing a single pixel.
%timeit slow_thing(image, 4, 5)
3.67 ms ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We now define the operation on the whole image
def process_image(image): for x in range(image.shape): for y in range(image.shape): slow_thing(image, x, y)
We measure the time of this function as mentioned above
372 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This function is quite slow and parallelization may make sense.
Parallelization using joblib.Parallel#
A simple and straightforward approach for parallelization is using
from joblib import Parallel, delayed, cpu_count
Note the reverse writing of the for loops in the following block. The term
delayed(slow_thing)(image, x, y) is technially a function call, that is not executed. Later, when the return value of this call is actually needed, then the actualy execution will happen. See dask delayed for details.
def process_image_parallel(image): Parallel(n_jobs=-1)(delayed(slow_thing)(image, x, y) for y in range(image.shape) for x in range(image.shape))
59.1 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A speedup of 7 is not bad. The
n_jobs=-1 implies that all compute units / threads are used. We can also print out how many compute cores were used:
For documentation purposes, we can also print out on what kind of CPU that algorithm was executed. This string might be more or less informative depending on what operating system / computer we are executing this notebook.
Benchmarking execution time#
In image processing, it is very common that execution time of algorithms shows different patterns depending on image size. We will now benchmark the algorithm above and see how it performs on differently sized images. To bind a function to benchmark to a given image without executing it, we are using the partial pattern.
def benchmark(target_function): """ Tests a function on a couple of image sizes and returns times taken for processing. """ sizes = np.arange(1, 5) * 10 benchmark_data =  for size in sizes: print("Size", size) # make new data image = np.zeros((size, size)) # bind target function to given image partial_function = partial(target_function, image) # measure execution time time_in_s = timeit.timeit(partial_function, number=10) print("time", time_in_s, "s") # store results benchmark_data.append([size, time_in_s]) return np.asarray(benchmark_data)
print("Benchmarking normal") benchmark_data_normal = benchmark(process_image) print("Benchmarking parallel") benchmark_data_parallel = benchmark(process_image_parallel)
Benchmarking normal Size 10 time 3.6755600420000007 s Size 20 time 14.606198167 s Size 30 time 32.68187025 s Size 40 time 58.304140666 s Benchmarking parallel Size 10 time 0.6165027079999987 s Size 20 time 2.0231766670000013 s Size 30 time 4.416658041999995 s Size 40 time 7.917167458000009 s
plt.scatter(benchmark_data_normal[:,0] ** 2, benchmark_data_normal[:,1]) plt.scatter(benchmark_data_parallel[:,0] ** 2, benchmark_data_parallel[:,1]) plt.legend(["normal", "parallel"]) plt.xlabel("Image size in pixels") plt.ylabel("Compute time in s") plt.show()
Text(0, 0.5, 'Compute time in s')
If we see this pattern, we speak of linear relationship between data size and compute time. Computer scientists use the O notation to describe the complexity of algorithms. This algorithm has
n represents the number of pixels in this case.
Let’s take a look at another algorithm.
def silly_sum(image): # Silly algorithm for wasting compute time sum = 0 for i in range(image.shape): for j in range(image.shape): for k in range(image.shape): for l in range(image.shape): sum = sum + image[i,j] - k + l sum = sum + i image[i, j] = sum / image.shape / image.shape
benchmark_data_silly_sum = benchmark(silly_sum)
Size 10 time 0.049324583000014854 s Size 20 time 0.7592800000000182 s Size 30 time 3.7161972079999828 s Size 40 time 11.734440750000005 s
plt.scatter(benchmark_data_silly_sum[:,0] ** 2, benchmark_data_silly_sum[:,1]) plt.legend(["normal"]) plt.xlabel("Image size in pixels") plt.ylabel("Compute time in s") plt.show()
Text(0, 0.5, 'Compute time in s')
This algorithm is stronger dependent on image size, the plot shows approximately quadratic complexity. That means if the data size douples, the compute time multiplies by four. The algorithms O-notation is
O(n^2). We could presume that a similar algorithm applied in 3D has cubic complexity,
O(n^3). If such algorithms are bottlenecks in your science, parallelization and GPU-acceleration make a lot of sense.
Code optimization using numba#
In case the code we perform is simple and just uses standard python, numpy etc. function, we can use a just-in-time (JIT) compiler, e.g. provided by numba to speedup the code.
from numba import jit @jit def process_image_compiled(image): for x in range(image.shape): for y in range(image.shape): # Silly algorithm for wasting compute time sum = 0 for i in range(1000): for j in range(1000): sum = sum + x sum = sum + y image[x, y] = sum
310 ns ± 0.827 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Note that in this section we only measured compute time of algorithms. We did not determine if the differently optimized versions of the algorithms produce the same result. Quality assurance is good scientific practice. The same is relevant in the context of GPU-acceleration and for example described in detail here.
To practice parallelization, parallelize the function
silly_sum shown above and plot its performance in comparison with the non-parallel version.