MPI+OpenMP: jacobi - hybrid through OpenMP parallelization


Prepare for these Exercises:

cd ~/HY-VSC/jacobi/C-hyb-jacobi            #   change into your C directory .OR. …

cd ~/HY-VSC/jacobi/F-hyb-jacobi             #   …  .OR. into your Fortran directory




Contents:

   1.  BASIC EXERCISE → Everybody should finish the basic exercise!

   2.  INTERLUDE: ROOFLINE MODEL AND LIGHT-SPEED PERFORMANCE → Will be explained!

   3.  ADVANCED EXERCISE → If your exercise group is really fast, you might try it.

   4.  EXERCISE ON OVERLAPPING COMMUNICATION AND COMPUTATION Tomorrow!




Jacobi Exercises: (→ see also slides #-#)


   ♦  This is a 2D Jacobi solver (5 point stencil) with a 1D domain decomposition and halo exchange.

   ♦  The given code is MPI-only.

   ♦  You can build it with:  make  (take a look at the Makefile)

   ♦  Edit the job-submission file:  vi  job.sh

   ♦  Run it with:  sbatch  job.sh


   ♦  Solutions are provided in the directory:  solution/   
      (in Fortran only for the basic exercise, other exercises see in C)




1.  BASIC EXERCISE (see step-by-step below):


  • Parallelize the code with OpenMP to get a hybrid MPI+OpenMP code.
  • Run it effectively on the given hardware.
  • Learn how to take control of affinity with MPI and especially with MPI+OpenMP.

NOTES:

  • The code is strongly memory bound at the problem size set in the input file.
  • Always run multiple times and observe run-to-run performance variations.
  • If you know how, try to calculate the maximum possible performance (ROOFLINE).


STEP-BY-STEP:

  →  Run the MPI-only code with 1,2,3,4,.... processes (in the course you may use up to 4 nodes),
       and observe the achieved performance behavior.

  →  Learn how to take control of affinity with MPI.

  →  Parallelize the appropriate loops with OpenMP (see Documentation links below). 

  →  Run with OpenMP and only 1 MPI process ("OpenMP-only") on 1,2,3,4,...,all cores of 1 node,
       and compare to the MPI-only run.

  →  Run hybrid variants with different MPI vs. OpenMP ratios.
       What's the best performance you can get with using all cores on 4 nodes?

  !!!  Does the OpenMP/hybrid code perform as well as the MPI-only code?
       →  If it doesn't, fix it!


RECAP - you might want to look into the slides or documentation:
  →  Memory placement - First touch! (→ see also slides #-#)
  →  PINNING = end of the previous exercises ( → see also slides #-#)

 → Documentation:         OpenMP         LIKWID         MPI


——— END OF BASIC EXERCISE ——— if there is still time, you might want to continue with ADVANCED (see below) …



2.  INTERLUDE: ROOFLINE MODEL AND LIGHT-SPEED PERFORMANCE


ROOFLINE:

       Performance  [GFlop/s]   vs.   Computational/Arithmethic Intensity  [Flop/Byte]

       2 Hardware limitations: either PEAK Flops or PEAK Memory bandwidth

       Performance P = min (P_max, b_mem / B_c )


Maximum Performance of a COMPUTE BOUND CODE = P_max

       Theoretical value can be computed by specification:

       (More realistic value could be obtained by running Linpack.)

       Theoretical P_max = [1|2] Number of sockets

                                      *  2.6  Core frequency

                                      *  8    Number of cores

                                      *  8    Number of single precisions in SIMD register

                                      *  1    Number of ports for floating point operations

       P_max (1 socket)                   = 166.4 GFlops/s

       P_max (2 sockets = 1 node ) = 332.8 GFlops/s


Maximum possible Performance of a  MEMORY BOUND CODE = P_bw = b_mem / B_c


Memory bandwidth = b_mem:

       Theoretical value can be computed by specification:

       Theoretical b_max = [1|2] Number of sockets

                                      * (...)   Memory Frequency [Gcycle/s]

                                      * 8     Byte per channel [Byte/channel/cycle]

                                      * 4     Number of mem channels [channel]

       b_max(1.866) = 59.7 GB/s     \

       b_max(1.600) = 51.2 GB/s      |    depends, on what kind of memory is in the nodes ?

       b_max(1.333) = 42.6 GB/s     /


        STREAM benchmark provides a more realistic value:

        STREAM benchmark = Sustainable Memory Bandwidth in High Performance Computers

        Download from: https://www.cs.virginia.edu/stream/FTP/Code/stream.c

        see, e.g., ../stream/

        compile with: icc -Ofast -fno-alias -qopenmp -ffreestanding -qopt-streaming-stores never stream.c

        make sure to properly bind threads to 1 socket (or 2 sockets = 1 node)

        run without NT stores (-qopt-streaming-stores never) but with correction factors (3/2 and 4/3)


                            STREAM benchmark:      Function: GB/s GB/s
vsc3:       mem_0064   mem_0128 
Copy  *3/2 =   42.4 46.8
best out of 10 runs      Scale  *3/2 =   41.1 45.9
without NT stores      Add    *4/3 =   41.6 46.1
Triad   *4/3 =  42.4 46.8
b_mem =      42.5 46.8
    


P_bw = b_mem / B_c = Memory bandwidth / Code balance

             Code balance = B_c = Data trafic / Floating point operations  [Byte/Flop]

             ( Machine balance = B_m = b_max / P_max )

             Computational/Arithmethic Intensity = 1 / B_c  [Flop/Byte]


==> let's count per LUP (lattice update):

       FLOATING POINT OPERATIONS  |   jacobi.c: simply count all + - * /
                                                             #  fLRes = ( ax * (UOLD(j, i-1) + UOLD(j, i+1))
                                                             #            + ay * (UOLD(j-1, i) + UOLD(j+1, i))
                                                             #            +  b * UOLD(j, i) - F(j, i)) / b;
                                                             #   U(j,i) = UOLD(j,i) - data->fRelax * fLRes;
                                                             #   residual += fLRes * fLRes;
       FLOATING POINT OPERATIONS = 13 Flop

        DATA TRAFFIC = 4 x 8B = 32B  | jacobi.c: read UOLD(j+1, i) [others in cache]
                                                           |                 read F(j, i)
                                                           |                 read before + write U(j,i)
                                                           #  fLRes = ( ax * (UOLD(j, i-1) + UOLD(j, i+1))
                                                           #            + ay * (UOLD(j-1, i) + UOLD(j+1, i))
                                                           #            +  b * UOLD(j, i) - F(j, i)) / b;
                                                           #   U(j,i) = UOLD(j,i) - data->fRelax * fLRes;
                                 + 3 x 8B = 24B  |  jacobi.c: read U(j, i)
                                                           |                 read before + write UOLD(j, i)
                                                           #  UOLD(j, i) = U(j, i);
        DATA TRAFFIC = 56 Byte


P_bw_jacobi_orig = b_mem / B_c b_mem * Floating point operations / Data trafic

vsc3 - mem_0064 - 1 socket: =   9.9 GFlop/s = 42.5 GByte/s * 13 Flop / 56 Byte

vsc3 - mem_0128 - 1 socket: = 10.9 GFlop/s = 46.8 GByte/s * 13 Flop / 56 Byte


P_bw_jacobi_advanced = (with reduced data traffic → 32 Byte)

vsc3 - mem_0064 - 1 socket: = 17.3 GFlop/s

vsc3 - mem_0128 - 1 socket: = 19.0 GFlop/s


==> This is the light-speed performance of jacobi on 1 socket of vsc3 !




3.  ADVANCED EXERCISE:


Think about how you could improve the performance of the code...


Have a look on how the halo communication ist done.
             ==> halo communication is overlapped with inner update (U_old = U)

Why is non-blocking MPI communication used here?

Does it make sense to overlap halo communication with inner update?

If not, do you still need the non-blocking version of the MPI routines? Why? Why not?


Is there a way to increase the computational intensity of the code (reduce data traffic)?


RECAP - you might want to look into the slides or documentation:

  →  Memory placement - First touch! (→ see also slides #-#)
  →  PINNING = end of the previous exercises ( → see also slides #-#)

 → Documentation:         OpenMP         LIKWID         MPI



——— END OF ADVANCED EXERCISE ——— next topic will be worked on tomorrow …



4.  EXERCISE ON OVERLAPPING COMMUNICATION AND COMPUTATION:


Finally, let's overlap communication and computation...


1. substitute the omp for by a taskloop:

   parallel{ single{ taskloop for{<compute stencil and update>}}}

   --> this allows you to see the overhead of taskloop

   --> maybe you need a larger problem size to work on (input)

   --> grainsize might help...


2. overlapping communication and computation:

   parallel{ single{ task{halo exchange + halo rows} taskloop{internal computation} }}



RECAP - you might want to look into the slides or documentation:

  →  Overlapping communication and computation  (→ see also slides #, #, #-# (taskloop))

  →  Memory placement - First touch! (→ see also slides #-#)
  →  PINNING = end of the previous exercises  (→ see also slides #-#)

 → Documentation:         OpenMP         LIKWID         MPI


——— END OF OVERLAPPING EXERCISE ———
Congratulations, you have successfully finished all Jacobi exercises
                             on hybrid programming through OpenMP parallelization !
→ You are now ready to work on (and improve the performance of) your own codes !
But first:  ENJOY THE COFFEE BREAK !!!






Last modified: Monday, 15 June 2020, 5:13 PM