A 3D stencil smoother

Perform some basic benchmarking for the plain 3D Jacobi solver. Example codes in Fortran and C can be found in the folder J3D. The inner loop has been delegated to a separate function; this makes it easier for the compiler to generate optimal code. For the initial tests you should make sure that the processor clock speed is set to its nominal value. Use likwid-pin to bind OpenMP threads to cores when necessary.

  1. Parallelize the 3D Jacobi solver using OpenMP and perform a scalability study for 1,...,n, and 2n threads (n being the number of physical cores on a socket) at a problem size of 200x200x200. Does your code scale? What is the achieved memory bandwidth per socket? 
  2. Measure the full-socket performance of the solver in MLUPs/s  for a grid of size NxNxN, with N between 10 and 800. What do you observe for very large and very small problem sizes? Can you interpret the results? 
  3. Use an OpenMP schedule of "static,1" for parallelization of the outer loop at a problem size of 300x300x300 and compare with "static". Can you interpret the result?
  4. Compile the parallel code with -O0 (no optimization). What is the impact on

    1. performance
    2. scalability

      within a socket (1..10 threads) at a problem size of 200x200x200?

  5. Use likwid-perfctr to measure the code balance of the stencil sweep in B/LUP at N=200 and N=800. Does it coincinde with your model?

  6. Using likwid-perfctr, measure the energy (in Joules) your code "burns" per LUP. Does this number depend on the problem size? Can you optimize the energy consumption at N=800 without changing the code? How about spatial blocking?
  7. How does the energy per LUP depend on the number of cores?

Last modified: Friday, 31 March 2017, 9:57 AM