### General

Even in scientific computing, code development often lacks a basic understanding of performance bottlenecks and relevant optimization opportunities. Textbook code transformations are applied blindly without a clear goal in mind. This course teaches a structured model-based performance engineering approach on the compute node level. It aims at a deep understanding of how code performance comes about, which hardware bottlenecks apply and how to work around them. The pivotal ingredient of this process is a model which links software requirements with hardware capabilities. Such models are often simple enough to be done with pencil and paper (such as the well-known Roofline model), but they lead to deep insights and strikingly accurate runtime predictions. The course starts with simple benchmark kernels and advances to various algorithms from computational science. Hands-on exercises using remote access to a cluster at the University of Erlangen will enable attendees to try the concepts themselves.

Lecturers:

Dr. Georg Hager & Prof. Dr. Gerhard Wellein

Friedrich-Alexander University of Erlangen-Nuremberg, Germany,

and LIKWID High Performance Programming UG

Course dates: November 7-9, 2018, 8:30am-4:30pm

#### Course Schedule

**Introduction**

- Our approach to performance engineering
- Basic architecture of multicore systems: threads, cores, caches, sockets, memory
- The important role of system topology

**Tools: controlling code execution in multicore environments**

- Overview
- likwid-topology and likwid-pin
- likwid-setFrequencies

**Microbenchmarking for architectural exploration**

- Properties of data paths in the memory hierarchy
- Bottlenecks
- OpenMP barrier overhead

**Roofline model: basics**

- Model assumptions and construction
- Simple examples
- Limitations of the Roofline model

**Tools: hardware performance counters**

- Why hardware performance counters?
- likwid-perfctr
- Validating performance models

**Optimal use of parallel resources**

- Single Instruction Multiple Data (SIMD)
- Cache-coherent Non-Uniform Memory Architecture (ccNUMA)
- Simultaneous Multi-Threading (SMT)

**Roofline case studies**

- Stencil smoothers
- Sparse matrix-vector multiplication
- Tall and skinny dense matrix-vector multiplication

**The ECM performance model**

- Basics & simple examples
- Case study: A conjugate gradient solver

**Optional: **

- Energy consumption aspects of computing
- Introduction to the NEC SX-Aurora "Tsubasa" vector processor