Topic outline

  • General

    This course teaches performance engineering approaches on the compute node level. “Performance engineering” as we define it is more than employing tools to identify hotspots and bottlenecks. It is about developing a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. Once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of optimizations can often be predicted. We introduce a “holistic” node-level performance engineering strategy centered around the roofline performance model and apply it to different algorithms from computational science. We also show that simple, easy to use tools bring us a long way towards deep insight into the interaction between software and hardware.

    Lecturers: Dr. Georg Hager and Prof. Gerhard Wellein

    Agenda:

    Introduction
    
      Our approach to performance engineering
      Basic architecture of multicore systems: threads, cores, caches,
         sockets, memory
      The important role of system topology
    
    Tools: topology & affinity in multicore environments
    
      Overview
      likwid-topology and likwid-pin
    
    Microbenchmarking for architectural exploration
    
      Properties of data paths in the memory hierarchy
      Bottlenecks
      OpenMP barrier overhead
    
    Roofline model: basics
    
      Model assumptions and construction
      Simple examples
      Limitations of the Roofline model
    
    Tools: hardware performance counters
    
      Why hardware performance counters?
      likwid-perfctr
      Case study: Detecting load imbalance
    
    Roofline case studies
    
      Dense matrix-vector multiplication
    Sparse matrix-vector multiplication
    Jacobi (stencil) smoother
    Optimal use of parallel resources
    
      Single Instruction Multiple Data (SIMD)
      Cache-coherent Non-Uniform Memory Architecture (ccNUMA)
      Simultaneous Multi-Threading (SMT)
    
    Extending Roofline: The ECM performance model
    • Supplementary material

      For publications of the RRZE HPC group, see http://blogs.fau.de/hager/publications

      Important links:

      LIKWID tool suite: https://github.com/RRZE-HPC/likwid

      LIKWID documentation Wiki: http://tiny.cc/LIKWID

      Kerncraft automatic Roofline/ECM modeling tool: https://github.com/RRZE-HPC/kerncraft

      GHOST sparse building blocks library: http://tiny.cc/GHOST

      PHIST, a Pipelined Hybrid Parallel Iterative Solver Toolkit: https://bitbucket.org/essex/phist


      STREAM source code: stream.c

      Compile with, e.g.:

      icc -Ofast -xHost -qopenmp -fno-alias -nolib-inline -qopt-streaming-stores never|always -o stream.exe stream.c

      Run with:

      likwid-pin -c <pin_mask> ./stream.exe

      LIKWID-instrumented STREAM source code: stream-mapi.c

      Compile with, e.g.:

      icc <options-from-above> -DLIKWID_PERFMON -I<path_to_likwid_inc> stream-mapi.c -o stream-mapi.exe -L<path_to_likwid_lib> -llikwid

      Run with:

      likwid-perfctr -C <pin_mask> -m -g <perf_group> ./stream-mapi.exe

      Vector Triad throughput benchmark: triad-throughput.tar.gz

      Compile with:

      icc -c timing.c
      icc -c dummy.c
      ifort -Ofast -xHost -qopenmp -fno-alias -fno-inline triad-tp.f90 dummy.o timing.o -o triad.exe

      Run with:

      echo <size> | likwid-pin <PIN_OPTIONS> ./triad.exe


      Dense matrix-vector multiplication code (with LIKWID markers): dmvm-plain.tar.gz

      Build with:

      icc -c timing.c
      ifort -Ofast -xHost -qopenmp -I<path_to_likwid_inc> dmvm.f90 timing.o -L<path_to_likwid_lib> -llikwid -o dmvm-plain.exe
      
      

      Run with:

      echo <NUM_ROWS> <TOTAL_MATRIX_ELEMENTS> | likwid-perfctr -g <METRIC_GROUP> -C <PIN_EXPR> -m ./dmvm-plain.exe

      Jacobi 3D stencil code: j3d_with_likwid.tar_.gz

      Build with the supplied Makefile (may need to adapt to your LIKWID setup).

      Run with:

      likwid-perfctr -C <pin_mask> -m -g <perf_group> ./J3D.exe <size>


      Sparse matrix benchmark code (CSR/SELL-C-sigma): sparsematrixbench.tar.gz