Topic outline

  • General

    Note: This page shows the content of the 2016 course at NASA Langley Research Center. It will be updated for the 2018 course in October.

    Even in scientific computing, code development often lacks a basic understanding of performance bottlenecks and relevant optimization opportunities. Textbook code transformations are applied blindly without a clear goal in mind. This course teaches a structured model-based performance engineering approach on the compute node level. It aims at a deep understanding of how code performance comes about, which hardware bottlenecks apply and how to work around them. The pivotal ingredient of this process is a model which links software requirements with hardware capabilities. Such models are often simple enough to be done with pencil and paper (such as the well-known Roofline model), but they lead to deep insights and strikingly accurate runtime predictions. The course starts with simple benchmark kernels and advances to various algorithms from computational science.

    Lecturers:

    Dr. Georg Hager & Prof. Dr. Gerhard Wellein

    Friedrich-Alexander University of Erlangen-Nuremberg, Germany, 
    and LIKWID High Performance Programming UG

    Course dates: November 7-9, 2018

    Course Schedule

    Introduction

    • Our approach to performance engineering
    • Basic architecture of multicore systems: threads, cores, caches, sockets, memory
    • The important role of system topology

    Tools: topology & affinity in multicore environments

    • Overview
    • likwid-topology and likwid-pin

    Microbenchmarking for architectural exploration

    • Properties of data paths in the memory hierarchy
    • Bottlenecks
    • OpenMP barrier overhead

    Roofline model: basics

    • Model assumptions and construction
    • Simple examples
    • Limitations of the Roofline model

    Tools: hardware performance counters

    • Why hardware performance counters?
    • likwid-perfctr
    • Validating performance models

    Roofline case studies

    • Dense matrix-vector multiplication
    • Sparse matrix-vector multiplication
    • Jacobi (stencil) smoother


    Optimal use of parallel resources

    • Single Instruction Multiple Data (SIMD)
    • Cache-coherent Non-Uniform Memory Architecture (ccNUMA)
    • Simultaneous Multi-Threading (SMT)

    Extending Roofline: The ECM performance model

    Sparse matrix-vector multiplication reloaded: The SELL-C-sigma format

    Optional: Pattern-based performance engineering

    Optional: Energy consumption aspects of computing

  • Miscellaneous material

    Important links:

    LIKWID tool suite: https://github.com/RRZE-HPC/likwid

    LIKWID documentation Wiki: http://tiny.cc/LIKWID

    Kerncraft automatic Roofline/ECM modeling tool: https://github.com/RRZE-HPC/kerncraft

    GHOST sparse building blocks library: http://tiny.cc/GHOST

    PHIST, a Pipelined Hybrid Parallel Iterative Solver Toolkit: https://bitbucket.org/essex/phist


    STREAM source code: stream.c

    Compile with, e.g.:

    icc -Ofast -xHost -qopenmp -fno-alias -nolib-inline -qopt-streaming-stores never|always -o stream.exe stream.c

    Run with:

    likwid-pin -c <pin_mask> ./stream.exe

    LIKWID-instrumented STREAM source code: stream-mapi.c

    Compile with, e.g.:

    icc <options-from-above> -DLIKWID_PERFMON -I<path_to_likwid_inc> stream-mapi.c -o stream-mapi.exe -L<path_to_likwid_lib> -llikwid

    Run with:

    likwid-perfctr -C <pin_mask> -m -g <perf_group> ./stream-mapi.exe

    Vector Triad throughput benchmark: triad-throughput.tar.gz

    Compile with:

    icc -c timing.c
    icc -c dummy.c
    ifort -Ofast -xHost -qopenmp -fno-alias -fno-inline triad-tp.f90 dummy.o timing.o -o triad.exe

    Run with:

    echo <size> | likwid-pin <PIN_OPTIONS> ./triad.exe


    Jacobi 3D stencil code: j3d_with_likwid.tar_.gz

    Build with the supplied Makefile (may need to adapt to your LIKWID setup).

    Run with:

    likwid-perfctr -C <pin_mask> -m -g <perf_group> ./J3D.exe <size>


    Sparse matrix benchmark code (CSR/SELL-C-sigma): sparsematrixbench.tar.gz