Gabriele Codega, gcodega@sissa.it, AMMA
Code for the exam in Development tools for Scientific Computing, SISSA, a.y. 2024-2025.
The goal here was to implement matrix-matrix multiplication in distributed memory. Details about the implementation are a bit further down, but the general idea is to split the matrices we want to multiply between MPI processes, and let each process compute a chunk of the result. The whole distributed multiplication requires a bunch of steps such as computing the workload for each process, initialising the data, communicating and computing individual chunks, hence it is not straight forward to write some distributed_multiply
routine. In fact, in this code there is no such routine, but rather the whole distributed machinery is provided in scripts/run.py
.
In src/matmul/routines.py
are a number of matrix-matrix multiplication routines (serial, parallel, tiled, GPU-accelerated) that can be used in the distributed algorithm. The performance of the distributed algorithm depends on the performance of the base routine.
All the code is implemented in Python. NumPy is employed to manipulate the matrices, while Numba is used to JIT compile routines in serial, parallel, CPU and GPU code. The MPI is provided by mpi4py.
NOTE: Installing mpi4py with pip
requires a working installation of MPI on the machine. Also, for GPU computing, a recent version of the CUDA Toolkit is required (see Numba for details).
Clone this repo locally and then run
python -m pip install --no-cache-dir --no-binary=mpi4py -r requirements.txt
python -m pip install .
Optionally install dependencies for testing (test
), profiling (profile
) or both (dev
) with
python -m pip install .[<DEPENDENCY>]
As it turns out, NVHPC ships with all is needed here. One issue is that mpi4py is not really meant to be compiled with nvc by default. If you have issues while installing you may want to try this
CFLAGS=-noswitcherror python -m pip install --no-cache-dir --no-binary=mpi4py mpi4py
You may also have to set CUDA_HOME
, NUMBAPRO_NVVM
and NUMBAPRO_LIBDEVICE
environment variables. For example
export CUDA_HOME=$NVHPC_ROOT/cuda/12.0
export NUMBAPRO_NVVM=$NVHPC_ROOT/cuda/12.0/nvvm/lib64
export NUMBAPRO_LIBDEVICE=$NVHPC_ROOT/cuda/12.0/nvvm/libdevice
You can get container images with this code from DockerHub as well. The images are built with CUDA 12.4 and still require NVIDIA drivers on the host machine to run. If you plan on running a Docker container you can get the image with
docker pull gcodega/matmul:cuda12.4-docker
If you plan on running a Singularity container, you can get a different tag
docker pull gcodega/matmul:cuda12.4-singularity
Note that to run Docker with CUDA support you may need the NVIDIA Container Toolkit, whereas Singularity natively supports CUDA.
The tags only differ in that the Docker image has some custom user (tony:matmul), whereas the Singularity image runs as root. Not setting a different user in the Singularity image is actually recommended, as Singularity containers inherit the user from the host, and setting a user different from root may cause issues with environments inside the container. Also, if you want to run the code on some HPC facility you may want to use the Singularity image, as it can interact with the host MPI and run on multiple nodes. Note that in this case performance may not be optimal, as the OpenMPI inside the container is not optimized for any specific machine.
To check out how different routines perform you can run scripts/run.py
. After installing the package, you can modify examples/config.yaml
by specifying the following parameters:
device
: either 'cpu' or 'gpu', depending on what device you want to run your code onsize
: one integer that determines the size of the two matrices to be multiplied (assuming square matrices)function
: requires two fieldsroutine
: name of the routine you want to use, as implemented insrc/matmul/routines.py
block_size
: an integer that determines the size of the blocks in tiled matrix multiplication. For the non-tiled routines this is required for signature consistency but is not actually used. For the GPU routine this needs to be consistent with theBLOCK_SIZE
variable insrc/matmul/routines.py
, and ideally with the grid dimension used to invoke the kernel.
print
: eitherTrue
orFalse
, whether to print the result
You can run the script with
mpirun -n <ntasks> python scripts/run.py --config experiments/config
Note that if you want to run this in serial you still need to use mpirun -n 1 ...
. Moreover, should you run any of this code on an HPC facility and submit a SLURM job, also note that SLURM's srun
might not work with mpi4py, and you may need to use mpirun
instead (in shell/submit.sbatch
you can find the script that I used to submit jobs on Ulysses at SISSA). Finally, when running through Singularity you may need to specify absolute paths for the scripts (all source code is in /shared-folder/app
).
The script will print to screen the time spent in multiplying the matrices (i.e. no communication time or others). You can get more insights by profiling the code with kernprof. The script in shell/submit.sh
lets you run one instance of kernprof for each MPI task and save the results on different files. You can select the number of threads for parallel routines by changing NUMBA_NUM_THREADS
and customize the output path for kernprof. Run the script as
mpirun -n <ntasks> shell/submit.sh
The script also provides an example of how you could profile the code with valgrind to check the cache efficiency of different routines. Note that using valgrind with Python is really not optimal, since typically the interpreter makes a lot of function calls and it is not easy to identify which calls refer to specific lines in your code. If cache utilisation is substantially different between two routines you can still expect to see differences in the valgrind summary.
Suppose we want to multiply two matrices
The idea is to distribute rows of
Communicating a single column of
Assume that the workload is perfectly balanced, so that each process gets
With this method, after
In practice we can use MPI.Allgatherv
for communications, so that each process gets what they need from every other process (also accounting for possible small workload imbalance). To actually perform communications we need the send buffer to be a contiguous memory region, which is not our case if the matrices are stored in row-major order, so before being able to cmmunicate we need to copy the portion of
The algorithm goes something like this
// share workload and initialise matrices
for i = 1...n_tasks
create_block(B,B_block)
MPI.Allgatherv(B_col, ..., B_block, ...)
matmul(A, B_col, C) // we need to make sure we write in the correct locations of C
In this process, matmul
can be whatever routine that can be used by a single process locally, and indeed we shall see that different routines can alter the performance significantly. It should also be pointed out that the routine does not need to run on CPU at all. Indeed, we can use a GPU kernel to perform the local products blazing fast, at the cost of being extra careful in managing memory.
In particular, when using a GPU kernel we need to move
As already noted, the local multiplication routine can be whatever, so I went ahead and implemented a bunch of algorithms to see how they perform.
CPU routines have the same signature void(double[:,:],double[:,:],double[:,:],int)
, where the arrays are
GPU routines have signature void(double[:,:],double[:,:],double[:,:])
, where the arrays are device arrays and the tile size is gone since it is needed at compile time. These also need to be called as routine[gridSize, blockSize](A,B,C)
, where the arguments in square brackets determine the layout of GPU threads (this is how CUDA works, I don't make the rules here).
Matrix multiplication is a memory-bound problem, meaning that the biggest bottleneck is not computation but memory access. In each routine then we try to explore how different memory access patterns affect the performance.
This is the standard triple for loop over i,j,k
. We compute C[i,j] += A[i,k] * C[k,j]
, with k
being the fastest index. Actually, instead of directly incrementing C[i,j]
, we use a local accumulator to be a bit more cache-friendly. This is already very slow for large matrices, so we at least use Numpy arrays (instead of lists).
It turns out that iterating through i,k,j
is much more cache-friendly, since we also iterate through i
is parallel).
This should be the most efficient algorithm in terms of cache utilisation (emphasis on should). The idea is that we access the matrices in small blocks that can fit into cache, so as to reduce cache misses. This means that we basically use a small block of
Difficulties here stem from the fact that choosing the best block size is not easy, as it strongly depends on the hardware (mainly on cache size). This, together with the fact that there are three more for
loops and more complex boundary checking, can make this method slower than the previous.
As before, there are two JIT compiled implementations, serial and parallel, none of which I managed to make run faster than the loop reordering routines.
(I even tried plain C implementations with memory alignment and explicit avx512 instructions, but the improvement over loop reordering was not substantial at all)
GPUs have virtually infinite threads, in the sense that the overhead for thread creation and contex swapping is really small. This means that we can make each thread compute a single entry of i,j
in the grid, we simply iterate over k
, accumulate A[i,k] * B[k,j]
and set the final value of C[i,j]
. We just need to make sure that we have enough threads to fully cover
The idea is pretty much the same as for the CPU version, only this time we have unlimited threads. The whole point here is once again to improve cache efficiency, which can be done by using CUDA shared memory arrays. Indeed, threads in the same block can access an L1 cache which is much faster than the global shared memory. We can exploit this cache by creating shared arrays that fit in cache and contain blocks of
The parallel capabilities of GPUs make this blocked algorithm much more useful and efficient with respect to the CPU counterpart.
Let's start by saying that the CPU routines are not nearly as fast as Numpy routines, which was expected, while the blocked GPU routine apparently is as fast as PyTorch matmul
routine, which was not expected.
A first interesting comparison is the runtime for different routines and increasing matrix size. In the figure we can see the comparison between the serial routine (loop reordering) on a single core, the same routine used with MPI in distributed memory, the parallel routine with loop reordering on a single MPI task, and the blocked GPU routine on a single GPU. The hardware for this test consists of an 11th Gen Intel Core i7-11700 @ 2.5GHz and a single NVIDIA Quadro RTX GPU.
The solid boxes represent actual computational time to compute the product, the parallel hatch represents communication time, and the cross hatch is time to copy memory from host to device and vice versa.
We can see that the GPU routine is the fastest, followed by the two parallel CPU routines, which are comparable, and finally the serial routine. Curiously, for the smallest size the MPI routine is slower than the serial, possibly due to resources contention between processes. The shared memory routine is slightly faster than the MPI version, which again is possibly due to resource contention.
In this case the communication time is very small since all the processes run on the same physical CPU, while host-device synchronization is already noticeable.
The execution time for blocked CPU routines was typically twice as large as the loop reordering routines, this is why they are omitted here. Remember that the whole point of using blocked routines is to improve cache utilisation, which should lead to faster execution times. Since we are not getting faster routines, we may want to check if the cache misses are indeed fewer for the blocked routine, but unfortunately this is not straight forward to do in Python. Indeed, using a tool like valgrind to profile cache utilisation is not very informative here, since the Python interpreter makes lots of function calls and it is very hard to interpret the results. A possible workaround is to implement the same routines in C, which is much more well suited for valgrind, to see if we can at least hope that the Python implementations will do what we expect.
The table shows results obtained by profiling the naive, loop reordering and blocked routines implemented in C. Matrix size was gcc
and -O3
optimisation flag.
Routine | Dr | D1mr | DLmr | Dw | D1mw | DLmw |
---|---|---|---|---|---|---|
naive (accumulate) | 1,610,612,736 (42.2%) | 1,076,239,361 (79.5%) | 132,716 (14.8%) | 0 | 0 | 0 |
naive (set |
0 | 0 | 0 | 1,048,576 (0.1%) | 1,048,576 (53.2%) | 114,895 (14.2%) |
loop reordering | 1,073,742,848 (28.1%) | 134,349,825 (9.9%) | 261,121 (29.1%) | 536,873,984 (48.4%) | 0 | 0 |
blocked | 1,073,992,704 (28.2%) | 142,525,434 (10.5%) | 239,472 (26.6%) | 536,971,264 (48.4%) | 0 | 0 |
The column headers follow valgrind's notation. Specifically, 1
and L
indicate first and last level cache, r
and w
refer to data read and write, respectively, and m
stands for miss.
We can see that the naive version misses substantially more than the others, both in reading and in writing data, which is expected. In fact, the loop reordering and blocked routines do not miss at all in writing, and the difference between the two is very small. Things are not looking good for the blocked routine. In fact we see that it misses slightly more level one reads, but compensates by missing less last level ones. Is this enough for it to be the best routine? As it turns out, yes! (But just marginally). What happens here is that we do not really see information about level two cache, which might be the one making a difference. Indeed, there results are obtained for a block size equal to 148, which is almost perfect to fill the L2 cache. So does the same block size yield similar results for the Python code? Unfortunately, no. It is possible that the (small) speedup in the blocked routine is mostly due to compiler optimisations, which may not be available for the JIT compiled Python code.
To avoid undesired trouble, we set aside the blocked CPU routine and test how the parallel loop reordering CPU routine and blocked GPU routine scale for large matrices. The results reported here were obtained by running the code on Ulysses at SISSA. The nodes are equipped with 2 Intel Xeon E5-2683 v4, with 16 cores each and hyperthreading enabled (for a total of 64 threads), and 2 NVIDIA Tesla P100 GPUs. In the experiments, each node was assigned 2 MPI tasks, each with 32 threads and 1 GPU (just for the GPU routine). A good matrix size was
As before, solid bars are the multiplication time, parallel hatches are communications and cross hatches are host-device synchronisation. We notice that the GPU routine is much faster than the other, in spite of the matrix being four times the size. Moreover, we can see that the time for communication is larger for two or more nodes, which is both due to the larger number of MPI tasks and -- most importantly -- to the fact that these communications are internode and thus happen over the network, which is slower. Finally, we can see that the time for multiplication decreases with increasing number of nodes, which is what we expect. This is best appreciated by looking at the speedup in the following figure.
Here we can see that for both routines the speedup for multiplication is almost ideal. When also factoring in the communication and synchronisation times we get larger deviations from the ideal speedup, which also depend on the details of the machine.
To better understand the influence on the total execution time for large matrices and many processes, we can look at the results obtained by profiling the code with kernprof and line_profiler.
We observe that both for CPU and GPU, on one single node most of the time is spent in computations (t2.synchronize
for the GPU charts, since kernel execution is asynchronous and we rely on CUDA events for timing), whereas on four nodes, since the computation time is much smaller, MPI overhead dominates. In particular, for the CPU the most time is spent in initialising MPI, which is not so bad as this is done once. For the GPU, on the other hand, it is communications that take the longest, which is possibly a serious performance issue. This big difference in communication times between CPU and GPU implementations (which we can also see from bar charts) is not expected and it might be due to the specific network state at the time of the experiments. It is also interesting to notice that the time to initialise the matrices also reduces significantly with increasing number of nodes, which is indeed what we expect since the total workload is fixed. Notice that for the GPU implementation, a substantial slice of the pie is occupied by memory transfer from host to device and vice versa.
Let's wrap up the discussion by stressing once again that this code was not meant to compete with BLAS or cuBLAS by any means, and that if performance is of paramount importance one is better off by just using some highly optimized compiled code. Nonetheless, we have seen that with Python one can get good performance with minimal effort, by exploiting tools like Numba for JIT compilation (on GPU even) and mpi4py for distributed computing. Moreover, tools like kernprof make it very easy to profile code, find bottlenecks and study scalability properties, so long as one is not interested in low-level performance indicators (such as cache utilisation).