## MPI+OpenMP: jacobi - hybrid through OpenMP parallelization

#### Prepare for these Exercises:

cp -a ~a2c06aa/jacobi .             #   copy the jacobi directory

cd jacobi/C-hyb-jacobi            #   change into your C directory .OR. …

cd jacobi/

F-hyb-jacobi

#   …  .OR. into your Fortran directory

#### Contents:

1.  BASIC EXERCISE → Everybody should finish the basic exercise before lunch!

2.  INTERLUDE: ROOFLINE MODEL AND LIGHT-SPEED PERFORMANCE → Will be explained!

3.  ADVANCED EXERCISE → If your exercise group is really fast, you might try it before lunch!

4.  EXERCISE ON OVERLAPPING COMMUNICATION AND COMPUTATION → Afternoon!

♦  This is a 2D Jacobi solver (5 point stencil) with a 1D domain decomposition and halo exchange.

♦  The given code is MPI-only.

♦  You can build it with:  make  (take a look at the Makefile)

♦  Edit the job-submission file:  vi  job.sh

♦  Run it with:  sbatch  job.sh

♦  Solutions are provided in the directory:  solution/

#### 1.  BASIC EXERCISE (see step-by-step below):

• Parallelize the code with OpenMP to get a hybrid MPI+OpenMP code.
• Run it effectively on the given hardware.
• Learn how to take control of affinity with MPI and especially with MPI+OpenMP.
##### NOTES:

• The code is strongly memory bound at the problem size set in the input file.
• Always run multiple times and observe run-to-run performance variations.
• If you know how, try to calculate the maximum possible performance (ROOFLINE).

##### STEP-BY-STEP:

→  Run the MPI-only code with 1,2,3,4,.... processes (in the course you may use up to 4 nodes),
and observe the achieved performance behavior.

→  Learn how to take control of affinity with MPI.
Can you get an even higher performance with MPI-only by using less cores?

→  Parallelize the appropriate loops with OpenMP (see Documentation links below).

→  Run with OpenMP and only 1 MPI process ("OpenMP-only") on 1,2,3,4,...,all cores of 1 node, and compare to the MPI-only run.

→  Run hybrid variants with different MPI vs. OpenMP ratios.
What's the best performance you can get with using all cores on 4 nodes?

!!!  Does the OpenMP/hybrid code perform as well as the MPI-only code?
→  If it doesn't, fix it!

#### 2.  INTERLUDE: ROOFLINE MODEL AND LIGHT-SPEED PERFORMANCE

##### ROOFLINE:

Performance  [GFlop/s]   vs.   Computational/Arithmethic Intensity  [Flop/Byte]

2 Hardware limitations: either PEAK Flops or PEAK Memory bandwidth

Performance P = min (P_max, b_mem / B_c )

##### Maximum Performance of a COMPUTE BOUND CODE = P_max

Theoretical value can be computed by specification:

(More realistic value could be obtained by running Linpack.)

Theoretical P_max = [1|2] Number of sockets

*  2.6  Core frequency

*  8    Number of cores

*  8    Number of single precisions in SIMD register

*  1    Number of ports for floating point operations

P_max (1 socket)                   = 166.4 GFlops/s

P_max (2 sockets = 1 node ) = 332.8 GFlops/s

##### Maximum possible Performance of a  MEMORY BOUND CODE = P_bw = b_mem / B_c

Memory bandwidth = b_mem:

Theoretical value can be computed by specification:

Theoretical b_max = [1|2] Number of sockets

* (...)   Memory Frequency

* 8     Byte per channel

* 4     Number of mem channels

b_max(1.866) = 59.7 GB/s     \

b_max(1.600) = 51.2 GB/s      |    depends, on what kind of memory is in the nodes ?

b_max(1.333) = 42.6 GB/s     /

STREAM benchmark provides a more realistic value:

STREAM benchmark = Sustainable Memory Bandwidth in High Performance Computers

see, e.g., ../stream/

compile with: icc -Ofast -fno-alias -qopenmp -ffreestanding -qopt-streaming-stores never stream.c

make sure to properly bind threads to 1 socket (or 2 sockets = 1 node)

run without NT stores (-qopt-streaming-stores never) but with correction factors (3/2 and 4/3)

STREAM benchmark:        Function: GB/s
average over 10 runs Copy  *3/2 =   47.7
without NT stores on Scale  *3/2 =   46.7
ivyMUC — 1 socket Add    *4/3 =   46.8
==>   47 GB/s   <== Triad   *4/3 =   47.6

##### P_bw = b_mem / B_c = Memory bandwidth / Code balance

Code balance = B_c = Data trafic / Floating point operations  [Byte/Flop]

( Machine balance = B_m = b_max / P_max )

Computational/Arithmethic Intensity = 1 / B_c  [Flop/Byte]

==> let's count per LUP (lattice update):

FLOATING POINT OPERATIONS  |   jacobi.c: simply count all + - * /
#  fLRes = ( ax * (UOLD(j, i-1) + UOLD(j, i+1))
#            + ay * (UOLD(j-1, i) + UOLD(j+1, i))
#            +  b * UOLD(j, i) - F(j, i)) / b;
#   U(j,i) = UOLD(j,i) - data->fRelax * fLRes;
#   residual += fLRes * fLRes;
FLOATING POINT OPERATIONS = 13 Flop

DATA TRAFFIC = 4 x 8B = 32B  | jacobi.c: read UOLD(j+1, i) [others in cache]
|                 read before + write U(j,i)
#  fLRes = ( ax * (UOLD(j, i-1) + UOLD(j, i+1))
#            + ay * (UOLD(j-1, i) + UOLD(j+1, i))
#            +  b * UOLD(j, i) - F(j, i)) / b;
#   U(j,i) = UOLD(j,i) - data->fRelax * fLRes;
+ 3 x 8B = 24B  |  jacobi.c: read U(j, i)
|                 read before + write UOLD(j, i)
#  UOLD(j, i) = U(j, i);
DATA TRAFFIC = 56 Byte

#### = 19.1 GFlop/s  on 1 socket of ivyMUC

##### ==> This is the light-speed performance of jacobi on 1 socket of ivyMUC !

Think about how you could improve the performance of the code...

Have a look on how the halo communication ist done.
==> halo communication is overlapped with inner update (U_old = U)

Why is non-blocking MPI communication used here?

Does it make sense to overlap halo communication with inner update?

If not, do you still need the non-blocking version of the MPI routines? Why? Why not?

Is there a way to increase the computational intensity of the code (reduce data traffic)?

#### 4.  EXERCISE ON OVERLAPPING COMMUNICATION AND COMPUTATION:

If you would like to use a reduction within a taskloop (new since OpenMP 5.0)

you have to switch modules, both on the login node and within the job script:

module switch intel/17.0 intel/18.0

module switch mpi.intel/2017 mpi.intel/2018

Finally, let's overlap communication and computation...

1. substitute the omp for by a taskloop:

parallel{ single{ taskloop for{<compute stencil and update>}}}

--> maybe you need a larger problem size to work on (input)

--> grainsize might help...

2. overlapping communication and computation: