Exercise 4: a 3D stencil
Perform some basic benchmarking for the plain 3D Jacobi solver. Example codes in Fortran and C can be found in the folder J3D. You can start the binaries with a single number as a command line argument, which determines the size of the problem in one direction (the domain is always cubic):
$ ./J3D_F.exe 200
The inner loop has been delegated to a separate function; this makes it easier for the compiler to generate optimal code.
- Measure performance of the solver in MLUPs/s (million lattice site updates per second) for a grid of size NxNxN, with N between 10 and 800.
- Parallelize the 3D Jacobi solver using OpenMP and perform a scalability study for 1,...,10 (1 socket), and 20 threads (2 sockets) for the above problem sizes. Does your code scale? What is the achieved memory bandwidth per socket at N=250? Hint: You do not need a tool to get this number - measurement of the runtime is sufficient! (The cache size is 25 MB per socket)
- If your program does not scale from 1 to 2 sockets, make it scale!
- What do you observe for very large problem sizes (e.g., N=600)? What is the achieved memory bandwidth for 10 cores (1 socket)? What is the problem here, and what can be done about it?
- Endow the code with LIKWID markers to corroborate the hypothesis about layer conditions and code balance (24 B/LUP and 40 B/LUP, respectively).
- Compile the parallel code with -O0 (no optimization). Within a socket (1..6 threads), what is the impact on