MPI+OpenMP: he-hy - Hello Hybrid! - compiling, starting, pinning


Prepare for these Exercises:

cd ~/HY-VSC/he-hy     #   change into your he-hy directory


Contents:

README_he-hy.sh           #   README file (very long --> follow the Moodle) - do NOT run as script... - .sh mainly for colors

job_*.sh                          #   job-scripts to run the provided tools, 2 x job_*_exercise.sh

*.[c|f90]                          #   various codes (hello world & tests) - NO need to look into these!

vsc3/vsc3_slurm.out_*     #   vsc3 output files - sorted and with comments & debugging info



IN THE ONLINE COURSE this exercise shall be done in two parts:

    first exercise        =   1. + 2. + 4.          (skipping 3. for now)

    second exercise   =   5. + 3. + 6. + 7.   (after the talk on hardware and pinning)



    ! you have to check all hardware partitions you would like to use separately !


1. FIRST THINGS FIRST - PART 1:   find out about a (new) cluster - login node

    module (avail, load, list, unload); compiler (name & --version)

    Always check compiler names and versions, e.g. with: mpiicc --version !!!


2. FIRST THINGS FIRST - PART 2:   find out about a (new) cluster - batch jobs

    job environment, job scripts (clean) & batch system (SLURM); test compiler and MPI version

    job_env.sh,  job_te-ve_[c|f].sh,  te-ve*


SLURM (vsc3):

sbatch job*.sh                                      #   submit

sq                                                          #   check

scancel JOB_ID                                     #   cancel

output will be written to: slurm-*.out     #   output


3. --> skip this point in the first exercise (and come back later...)
3. MPI-pure MPI:      compile and run the MPI "Hello world!" program (pinning)

    job_he-mpi_[default|ordered].shhe-mpi.[c|f90]help_fortran_find_core_id.c

    ? Why is the output (most of the time) unsorted ? ==> here (he-mpi) you can use: ... | sort -n

    ? Can you rely on the defaults for pinning ? ==> Always take care of correct pinning yourself !


4. MPI+OpenMP: :TODO: how to compile and start an application
                                       how to do conditional compilation

    job_co-co_[c|f].sh,  co-co.[c|f90]

    Recap with Intel compiler & Intel MPI (→ see also slides #-#):

compiler:   ? USE_MPI  ? _OPENMP START APPLICATION:
C:      export OMP_NUM_THREADS=#
    with MPI mpiicc   -DUSE_MPI    -qopenmp             mpirun -n # ./<exe>
no MPI icc   -qopenmp   ./<exe>
Fortran: export OMP_NUM_THREADS=#
with MPI mpiifort   -fpp   -DUSE_MPI   -qopenmp   mpirun -n # ./<exe>
no MPI ifort  -fpp   -qopenmp  
./<exe>


      TODO:

    → Compile and Run (4 possibilities) co-co.[c|f90] = Demo for conditional compilation.

    → Do it by hand - compile and run it directly on the login node.

    → Have a look into the code:  co-co.[c|f90]  to see how it works.

    → It's also available as a script:  job_co-co_[c|f].sh


STOP HERE ------- THIS IS THE END OF THE fist exercise ------- STOP HERE


5. MPI+OpenMP: :TODO: get to know the hardware - needed for pinning
                                                                              (→ see also slides #-#)

      TODO:

    → Find out about the hardware of compute nodes:

    → Write and Submit: job_check-hw_exercise.sh

    → Describe the compute nodes... (core numbering?)

    → solution = job_check-hw_solution.sh

    → solution.out =  vsc3_slurm.out_check-hw_solution


      TODO:

    → Find out about the hardware of the vsc3plus nodes:

    → Write a job_topo3+.sh and Submit it with sbvsc3plus (alias) to run on the new VSC3+ nodes.

    → Describe the vsc3plus nodes... (what are these, how many sockets, cores, hyperthreads, core numbering?)

    → no solution available


3. --> do 3. now --> MPI-pure MPI (pinning)


6. MPI+OpenMP: :TODO: compile and run the Hybrid "Hello world!" program

    job_he-hy_exercise.sh,  he-hy.[c|f90],  help_fortran_find_core_id.c

      TODO:

    → Run he-hy on a compute node, i.e.:  sbatch job_he-hy_exercise.sh

    → Find out what's the default pinning with mpirun !

    → Look into: job_he-hy_exercise.sh 

    → Do NOT YET do the pinning exercise, see below 7.


    ? Why is the output (most of the time) unsorted ? ==> here (he-hy) you can use: ... | sort -n

    ? Can you rely on the defaults for pinning ? ==> Always take care of correct pinning yourself !


7. MPI+OpenMP: :TODO: how to do pinning

    job_he-hy_[exercise|solution].sh,  he-hy.[c|f90]


      TODO (see below for info):

    → Do the pinning exercise in:  job_he-hy_exercise.sh

    → one possible solution = job_he-hy_solution.sh


PINNING: (→ see also slides #-#)


Pinning depends on:

               batch system    [SLURM*] \
               MPI library [Intel*]  |   interaction between these !
               startup [mpirun*|srun]    /

Always check your pinning !

          → job_he-hy...sh (he-hy.[c|f90] prints core_id)

          → print core_id in your application (see he-hy.*)

          → turn on debugging info & verbose output in job

          → monitor your job → login to nodes: top [1 q]


Intel → PINNING is done via environment variables (valid for Intel-only!):


pure MPI:           I_MPI_PIN_PROCESSOR_LIST=<proclist>   (other possiblities see Web)


MPI+OpenMP:   I_MPI_PIN_DOMAIN (3 options) + KMP_AFFINITY

                            I_MPI_PIN_DOMAIN=core|socket|numa|node|cache|...

                            I_MPI_PIN_DOMAIN=omp|<n>|auto[:compact|scatter|platform]
                                                               omp - number of logical cores = OMP_NUM_THREADS

                            I_MPI_PIN_DOMAIN=[m_1,.....m_n] hexadecimal bit mask, [] included!


OpenMP:            KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]

                                    modifier                                             type (required)

                                    granularity=fine|thread|core|tile         compact
                                    proclist={<proc-list>}                         balanced
                                    [no]respect (an OS affinity mask)          scatter
                                    [no]verbose                                         explicit (no permute,offset)
                                    [no]warnings                                       disabled (no permute,offset)
                                                                                               none     (no permute,offset)

                                    default: noverbose,respect,granularity=core,none[,0,0]


Debug:               KMP_AFFINITY=verbose
                           I_MPI_DEBUG=4


Example:         1 MPI process per socket, 8 cores per socket, 2 sockets per node:

                         export OMP_NUM_THREADS=8

                         export KMP_AFFINITY=scatter

                         export I_MPI_PIN_DOMAIN=socket

                         mpirun -ppn 2 -np # ./<exe>

                                                           see:  job_he-hy_test-1ps.sh + slurm.out_he-hy_mpirun_test-1ps





Last modified: Sunday, 14 June 2020, 3:30 PM