A simple 3D Jacobi solver

Perform some basic benchmarking for the plain 3D Jacobi solver. Example codes in Fortran and C can be found in the folder J3D. You can start the binaries with a single number as a command line argument, which determines the size of the problem in one direction (the domain is always cubic):

$ ./J3D_F.exe 200

The inner loop has been delegated to a separate function; this makes it easier for the compiler to generate optimal code. 

  1. Measure performance of the solver in MLUPs/s (million lattice site updates per second) for a grid of size NxNxN, with N between 10 and 800. 

  2. Parallelize the 3D Jacobi solver using OpenMP and perform a scalability study for 1,...,10 (1 socket), and 20 threads (2 sockets) for the above problem sizes. Does your code scale? What is the achieved memory bandwidth per socket at N=250? Hint: You do not need a tool to get this number - measurement of the runtime is sufficient! (The cache size is 25 MB per socket on Emmy)

  3. If your program does not scale from 1 to 2 sockets, make it scale!

  4. What do you observe for very large problem sizes (e.g., N=600)? What is the achieved memory bandwidth for 10 cores (1 socket)? What is the problem here, and what can be done about it?

  5. The J3D folder contains a subfolder called markers, which holds a C version of the program with LIKWID markers and a makefile that works out of the box on Emmy with likwid 4.3.2. Use it to validate the 24 and 40 B/LUP code balance for the 3D Jacobi algorithm.

  6. Compile the parallel code with -O0 (no optimization). Within a socket (1..6 threads), what is the impact on

    1. performance?
    2. scalability?

Last modified: Friday, 9 November 2018, 3:55 PM