Exercise: A simple stencil solver
Perform some basic benchmarking for the plain 3D Jacobi solver. Example codes in Fortran and C can be found in the folder J3D. You can start the binaries with a single number as a command line argument, which determines the size of the problem in one direction (the domain is always cubic):
$ ./J3D_F.exe 200
The inner loop has been delegated to a separate function; this makes it easier for the compiler to generate optimal code.
- Measure performance of the solver in MLUPs/s (million lattice site updates per second) for a grid of size NxNxN, with N between 10 and 800.
- Parallelize the 3D Jacobi solver using OpenMP and perform a scalability study for 1,...,10 (1 socket), and 20 threads (2 sockets) for the above problem sizes. Does your code scale? What is the achieved memory bandwidth per socket at N=250? Hint: You do not need a tool to get this number - measurement of the runtime is sufficient! (The cache size is 25 MB per socket)
- If your program does not scale from 1 to 2 sockets, make it scale!
- What do you observe for very large problem sizes (e.g., N=600)? What is the achieved memory bandwidth for 10 cores (1 socket)? What is the problem here, and what can be done about it?
- The J3D folder contains a subfolder called markers, which holds a C version of the program with LIKWID markers and a makefile that you have to adapt to the local cluster's software environment. Use it to validate the 24 and 40 B/LUP code balance for the 3D Jacobi algorithm.
- Compile the parallel code with -O0 (no optimization). Within a socket (1..6 threads), what is the impact on