## Exercise 4: a 3D stencil

Perform some basic benchmarking for the plain 3D Jacobi solver. Example codes in Fortran and C can be found in the folder J3D. You can start the binaries with a single number as a command line argument, which determines the size of the problem in one direction (the domain is always cubic):

\$ ./J3D_F.exe 200

The inner loop has been delegated to a separate function; this makes it easier for the compiler to generate optimal code.

1. Measure performance of the solver in MLUPs/s (million lattice site updates per second) for a grid of size NxNxN, with N between 10 and 800.

2. Parallelize the 3D Jacobi solver using OpenMP and perform a scalability study for 1,...,10 (1 socket), and 20 threads (2 sockets) for the above problem sizes. Does your code scale? What is the achieved memory bandwidth per socket at N=250? Hint: You do not need a tool to get this number - measurement of the runtime is sufficient! (The cache size is 25 MB per socket)

3. If your program does not scale from 1 to 2 sockets, make it scale!

4. What do you observe for very large problem sizes (e.g., N=600)? What is the achieved memory bandwidth for 10 cores (1 socket)? What is the problem here, and what can be done about it?

5. Endow the code with LIKWID markers to corroborate the hypothesis about layer conditions and code balance (24 B/LUP and 40 B/LUP, respectively).

6. Compile the parallel code with -O0 (no optimization). Within a socket (1..6 threads), what is the impact on

1. performance?
2. scalability?

Last modified: Wednesday, 22 January 2020, 9:56 AM