Exercise: An OpenMP-parallel Jacobi solver
You will find an F90 and a C program in folder J3D. Both implement a Jacobi solver in 3D. The inner loop kernel (the “line update kernel”) is in a separate subroutine. If you run the code it requires a number on the command line (for C) or on standard input (for Fortran), which is the linear problem size N for the grid. Useful problem sizes are between N=10 (24 kB) and N=500 (3 GB). The benchmark reports performance in MLUP/s and chooses the number of iterations such that the runtime is larger than 0.5 seconds.
- The code is already OpenMP-parallel. Look at its performance for typical in-cache and out-of-cache situations
a) in comparison to the sequential code
b) with respect to the number of threads (1..10, 11..20). For which problem sizes do you get a speedup when using eight threads instead of one?
c) How many threads do you need to saturate the memory bus if the data does not fit into any cache?
d) Does your code perform in accordance with a bandwidth-based performance model with 8 threads at N=150? Look closely into the code to determine the correct value for the computational intensity.
- Look at the innermost loop (it is in a separate subroutine). Knowing that a divide is an operation that is veeeery slow on this processor, how can one speed up the calculation of a stencil update? What kind of change do you expect in terms of performance and scalability? Did the compiler already do "the right thing"?
- Repeat the measurements for your optimized code but compile with compiler option -O0 (no optimization). What about (i) speedup and (ii) performance when scaling from 1 to 10 threads on a socket?
- Optimize the code by spatial blocking so that it has the same performance at N=150 as at N=600 with 10 threads within a chip! Calculate the required block size before you do the actual implementation. Which loop in the nest is the best to parallelize in this case?
- What do you have to do to make the code scale across NUMA domains? How does the blocking optimization above interfere with the NUMA placement?
- Instead of the blocking optimization you may also consider using a different loop schedule. Try the static,1 schedule for the parallel (outer) loop of the non-blocked code with N=600 and 8 threads (1 socket). What is going on here? What about proper NUMA behavior?
- Compile the main program with -S -c (and proper optimization flags again) to get assembly output. Locate the assembly code for the line update kernel and figure out whether the code was SIMD vectorized for arithmetic and/or load/store operations!