Topic outline

  • General

    This course teaches performance engineering approaches on the compute node level. “Performance engineering” as we define it is more than employing tools to identify hotspots and bottlenecks. It is about developing a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. Once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of optimizations can often be predicted. We introduce a “holistic” node-level performance engineering strategy centered around the roofline performance model and apply it to different algorithms from computational science. We also show that simple, easy to use tools bring us a long way towards deep insight into the interaction between software and hardware.

    Lecturers: Dr. Georg Hager and Prof. Gerhard Wellein


      Our approach to performance engineering
      Basic architecture of multicore systems: threads, cores, caches,
         sockets, memory
      The important role of system topology
    Tools: topology & affinity in multicore environments
      likwid-topology and likwid-pin
    Microbenchmarking for architectural exploration
      Properties of data paths in the memory hierarchy
      OpenMP barrier overhead
    Roofline model: basics
      Model assumptions and construction
      Simple examples
      Limitations of the Roofline model
    Tools: hardware performance counters
      Why hardware performance counters?
      Case study: Detecting load imbalance
    Roofline case studies
      Dense matrix-vector multiplication
    Sparse matrix-vector multiplication
    Jacobi (stencil) smoother
    Optimal use of parallel resources
      Single Instruction Multiple Data (SIMD)
      Cache-coherent Non-Uniform Memory Architecture (ccNUMA)
      Simultaneous Multi-Threading (SMT)
    Extending Roofline: The ECM performance model
    • Supplementary material

      For publications of the RRZE HPC group, see

      Important links:

      LIKWID tool suite:

      LIKWID documentation Wiki:

      Kerncraft automatic Roofline/ECM modeling tool:

      GHOST sparse building blocks library:

      PHIST, a Pipelined Hybrid Parallel Iterative Solver Toolkit:

      STREAM source code: stream.c

      Compile with, e.g.:

      icc -Ofast -xHost -qopenmp -fno-alias -nolib-inline -qopt-streaming-stores never|always -o stream.exe stream.c

      Run with:

      likwid-pin -c <pin_mask> ./stream.exe

      LIKWID-instrumented STREAM source code: stream-mapi.c

      Compile with, e.g.:

      icc <options-from-above> -DLIKWID_PERFMON -I<path_to_likwid_inc> stream-mapi.c -o stream-mapi.exe -L<path_to_likwid_lib> -llikwid

      Run with:

      likwid-perfctr -C <pin_mask> -m -g <perf_group> ./stream-mapi.exe

      Vector Triad throughput benchmark: triad-throughput.tar.gz

      Compile with:

      icc -c timing.c
      icc -c dummy.c
      ifort -Ofast -xHost -qopenmp -fno-alias -fno-inline triad-tp.f90 dummy.o timing.o -o triad.exe

      Run with:

      echo <size> | likwid-pin <PIN_OPTIONS> ./triad.exe

      Dense matrix-vector multiplication code (with LIKWID markers): dmvm-plain.tar.gz

      Build with:

      icc -c timing.c
      ifort -Ofast -xHost -qopenmp -I<path_to_likwid_inc> dmvm.f90 timing.o -L<path_to_likwid_lib> -llikwid -o dmvm-plain.exe

      Run with:

      echo <NUM_ROWS> <TOTAL_MATRIX_ELEMENTS> | likwid-perfctr -g <METRIC_GROUP> -C <PIN_EXPR> -m ./dmvm-plain.exe

      Jacobi 3D stencil code: j3d_with_likwid.tar_.gz

      Build with the supplied Makefile (may need to adapt to your LIKWID setup).

      Run with:

      likwid-perfctr -C <pin_mask> -m -g <perf_group> ./J3D.exe <size>

      Sparse matrix benchmark code (CSR/SELL-C-sigma): sparsematrixbench.tar.gz