## Topic outline

### General

Even in scientific computing, code development often lacks a basic understanding of performance bottlenecks and relevant optimization opportunities. Textbook code transformations are applied blindly without a clear goal in mind. This course teaches a structured model-based performance engineering approach on the compute node level. It aims at a deep understanding of how code performance comes about, which hardware bottlenecks apply and how to work around them. The pivotal ingredient of this process is a model which links software requirements with hardware capabilities. Such models are often simple enough to be done with pencil and paper (such as the well-known Roofline model), but they lead to deep insights and strikingly accurate runtime predictions. The course starts with simple benchmark kernels and advances to various algorithms from computational science. Hands-on exercises using remote access to a cluster at the University of Erlangen will enable attendees to try the concepts themselves.

Lecturers:

Dr. Georg Hager & Prof. Dr. Gerhard Wellein

Friedrich-Alexander University of Erlangen-Nuremberg, Germany,

and LIKWID High Performance Programming UGCourse dates: November 7-9, 2018, 8:30am-4:30pm

#### Course Schedule

**Introduction**- Our approach to performance engineering
- Basic architecture of multicore systems: threads, cores, caches, sockets, memory
- The important role of system topology

**Tools: controlling code execution in multicore environments**- Overview
- likwid-topology and likwid-pin
- likwid-setFrequencies

**Microbenchmarking for architectural exploration**- Properties of data paths in the memory hierarchy
- Bottlenecks
- OpenMP barrier overhead

**Roofline model: basics**- Model assumptions and construction
- Simple examples
- Limitations of the Roofline model

**Tools: hardware performance counters**- Why hardware performance counters?
- likwid-perfctr
- Validating performance models

**Optimal use of parallel resources**- Single Instruction Multiple Data (SIMD)
- Cache-coherent Non-Uniform Memory Architecture (ccNUMA)
- Simultaneous Multi-Threading (SMT)

**Roofline case studies**- Stencil smoothers
- Sparse matrix-vector multiplication
- Tall and skinny dense matrix-vector multiplication

**The ECM performance model**- Basics & simple examples
- Case study: A conjugate gradient solver

**Optional:**- Energy consumption aspects of computing
- Introduction to the NEC SX-Aurora "Tsubasa" vector processor

### Day 1

Hands-on exercises for day 1:

### Day 2

Hands-on exercises for day 2:

### Day 3

Hands-on exercises for day 3:

### Miscellaneous material

LIKWID tool suite: https://github.com/RRZE-HPC/likwid

LIKWID documentation Wiki: http://tiny.cc/LIKWID

Kerncraft automatic Roofline/ECM modeling tool: https://github.com/RRZE-HPC/kerncraft

GHOST sparse building blocks library: http://tiny.cc/GHOST

PHIST, a Pipelined Hybrid Parallel Iterative Solver Toolkit: https://bitbucket.org/essex/phist