General “sequential” code optimization

Georg Hager (RRZE)
Reinhold Bader (RRZE)
General optimization: Outline

- “Common sense” optimizations
- Characterization of memory hierarchies
  - Bandwidth-based performance modeling
- Loop optimizations to reduce code balance
  - Fusion
  - Unrolling
  - blocking
- Profiling
- Some words about C++
“Common sense” optimizations: A Monte Carlo spin code
Optimization of a Spin System Simulation

- 3-D cubic lattice
- One variable ("spin") per grid point with values +1 or -1
- Next-neighbour interaction terms
- Code chooses spins randomly and flips them as required by MC algorithm
Optimization of a Spin System Simulation

- **Systems under consideration**
  - $50 \cdot 50 \cdot 50 = 125000$ lattice sites
  - $2^{125000}$ different configurations
  - Computer time: $2^{125000} \cdot 1 \text{ ns} \approx 10^{37000} \text{ years}$ – without MC 😊

- **Memory requirement of original program** $\approx 1$ MByte
Optimization of a Spin System Simulation:  
Original Code

- **Program Kernel:**

  - Load neighbors of a random spin
  - Calculate magnetic field
  - Decide about spin orientation

  ```
  IA = IZ(KL,KM,KN)
  IL = IZ(KLL,KM,KN)
  IR = IZ(KLR,KM,KN)
  IO = IZ(KL,KMO,KN)
  IU = IZ(KL,KMU,KN)
  IS = IZ(KL,KM,KNS)
  IN = IZ(KL,KM,KNN)

  edelz = iL + iR + iU + iO + iS + iN

  C   CRITERION FOR FLIPPING THE SPIN

  BF = 0.5d0 * (1.0 + tanh(edelz/tt))
  if (YHE.LE.BF) then
      iz(kl,km,kn) = 1
  else
      iz(kl,km,kn) = -1
  endif
  ```
Profiling shows that
• 30% of computing time is spent in the \texttt{tanh} function
• Rest is spent in the line calculating \texttt{edelz}

Why?
• \texttt{tanh} is expensive by itself
• Compiler fuses spin loads and calculation of \texttt{edelz} into a single line

What can we do?
• Try to reduce the “strength“ of calculations (here \texttt{tanh})
• Try to make the CPU move less data

How do we do it?
• Observation: argument of \texttt{tanh} is always integer in the range -6..6 (\texttt{tt} is always 1)
• Observation: Spin variables only hold values +1 or -1
Optimization of a Spin System Simulation: Making it Faster

- **Strength reduction by tabulation of tanh function**

\[
BF = 0.5d0 \times (1.d0 + \text{tanh\_table}(edelz))
\]

- Performance increases by 30% as table lookup is done with “lightspeed” compared to tanh calculation

- **By declaring spin variables with INTEGER*1 instead of INTEGER*4 the memory requirement is reduced to about \(\frac{1}{4}\)**
  - Better cache reuse
  - Factor 2–4 in performance depending on platform
  - Why don’t we use just one bit per spin?
    - Bit operations (mask, shift, add) too expensive → no benefit

- **Potential for a variety of data access optimizations**
  - But: choice of spin must be absolutely random!
Optimization of a Spin System Simulation:
Performance Results

- **Pentium 4 (2.4 GHz)**

![Bar chart showing runtime comparison between Original code, Table + 1Byte/Spin, and Table + 1Bit/Spin. The bar for Table + 1Bit/Spin is the longest, followed by Table + 1Byte/Spin, and the shortest is Original code.]

- Code optimization

(C) 2019 RRZE,LRZ
Optimization of data access
Data accesss:
Intel processors and the DRAM gap

- Deep pipeline → Fast clock
- SSE2

- Octo-core AVX
- Dual Core
- Quad Core
- 3-channel DDR3 on-chip, ccNUMA
- 18C, FMA
- 24C
- 28C
- 12C
- 6C

(C) 2019 RRZE,LRZ Code optimization
Using the memory hierarchy efficiently

- What conditions must an application code fulfill in order to make good use of the cache?

- **Cache as a high-speed data store:**
  Code must show good **temporal locality**, i.e. data items should be used multiple times before they get evicted from cache.

- **Cache line structure:**
  Code must show good **spatial locality**, i.e. data items just loaded must be near items to be loaded “soon”.

Implementing temporal and spatial locality by code transformations is the most powerful optimization strategy!

- **But:** What is “good enough?”
General remarks on algorithms and data access
Latency and bandwidth in modern computer environments

Avoiding slow data paths is the key to most performance optimizations!
Data access

- Data access is the most frequent performance-limiting factor in HPC
- Cache-based microprocessors feature small, fast caches and large, slow memory
  - “Memory Wall”, “DRAM Gap”
  - Latency can be hidden under certain conditions (prefetch, software pipelining)
  - Bandwidth limit cannot be circumvented
    - Instead, modify the code to avoid the slow data paths
- General guideline: examine “traffic-to-work” ratio (balance) of algorithm to get a hint at possible limitations
  - Examination of performance-critical loops is vital
  - Important metric: (“LOADs/STOREs to FLOPs”)
  - Optimization: lower LDST/FLOP ratio
- … and always remember that stride 1 access is best!
Kernel benchmarks:

- Characterize computer architecture (effective performance of processor & memory hierarchies)
- Results are easy to interpret and compare
- Provide upper performance bounds for “real” applications
- E.g.: STREAM, cachebench

Popular kernel benchmark: Vector Triad

```
REAL*8 (SIZE): A, B, C, D
DO ITER=1,NITER
  DO i = 1,N
    A(i) = B(i) + C(i) * D(i)
  ENDDO
<OBSCURE - use A,B,C,D>
ENDDO
```

Balance calculation

- Computation: 2 Flops / i-Iteration
- Data transfer: (3+1 LD & 1 ST) / i-Iteration
- Code balance = 2.5 W/F (20 bytes/flop)
Code balance ($B_c$) quantifies the requirements of a loop code:

$$B_c = \frac{\text{data transfer [bytes]}}{\text{arithmetic operations [flops]}}$$

Example: Vector triad $\mathbf{A}(:) = \mathbf{B}(:) + \mathbf{C}(:) \times \mathbf{D}(::)$

- $B_c = (4+1)$ Words / 2 Flops = 20 bytes/flop (including write allocate)

General rule: Reducing the code balance of a loop by optimizations will do something good for the performance!

- Often used: “Computational Intensity” $I = 1/B_c$

For refined analysis, code balance can also be defined for all memory hierarchy levels: $B_c^{\text{mem}}, B_c^{L3}, B_c^{L2}$

- Memory transfers are not always the data bottleneck!
Machine Balance

- For quick comparisons the concept of **machine balance** is useful

\[ B_m = \frac{b_s}{P_{\text{peak}}} \]

- **Machine Balance** = How much input data can be delivered for each FP operation? (“Memory Gap characterization”)
  - Assuming balanced MULT/ADD
  - Rough estimate: \( B_m \ll B_c \rightarrow \) strongly memory-bound code

- **Typical values (main memory):**

  Intel Haswell 14-core 2.3 GHz
  \[ B_m = 60 \text{ GB/s} / (14 \times 2.3 \times 16) \text{ GF/s} \approx 0.12 \text{ B/F} \]
  Intel Sandy Bridge 8-core 2.7 GHz \( \approx 0.23 \text{ B/F} \)
  Nvidia K20x \( \approx 0.14 \text{ B/F} \)
  Intel Xeon Phi Knights Corner \( \approx 0.16 \text{ B/F} \)
Evolution of machine balance

Machine balance $B^m_{mem}$ [byte/flop]

- NHL: Nehalem EP
- SNB: Sandy Bridge EP
- HSW: Haswell EP
- BDW: Broadwell EP
- SKL: Skylake SP
- KNC: Knights Corner
- KNL: Knights Landing

Year:
- 1994
- 1996
- 1998
- 2000
- 2002
- 2004
- 2006
- 2008
- 2010
- 2012
- 2014
- 2016
- 2018

Single core vs. Multicore

- NEC SX-8 (4 B/F)
- NEC SX-9 (2.56 B/F)
- NEC SX-ACE
- NEC Tsubasa
- KNC
- KNL (MCDRAM)
- KNL (main mem.)
- K20
- P100
- V100
- HSW
- BDW
- SKL

(C) 2019 RRZE,LRZ
Estimating loop performance

- “Roofline model”
- Calculate

\[ l = \min \left( 1, \frac{B_m}{B_c} \right) \]

This is the fraction of peak performance that the loop can achieve

- Multiply by \( P_{\text{peak}} \):

\[ P = \min \left( P_{\text{peak}}, \frac{b_s}{B_c} \right) \]

- This is a simple model to get an upper limit for loop performance
  - \( I > B_m^{-1} \) \( \rightarrow \) core-bound code
  - \( I < B_m^{-1} \) \( \rightarrow \) memory-bound code
for (i=0; i<N; ++i) {
    a[i] = a[i]+b[i];
}

\[B_C = 3 \text{ WORDS} / 1 \text{ FLOP} = 24 \text{ B/F}\]

Scalar – can be kept in register

for (i=0; i<N; ++i) {
    a[i] = a[i]+s*b[i];
}

\[B_C = 3 \text{ WORDS} / 2 \text{ FLOP} = 12 \text{ B/F}\]

Scalar – can be kept in register

s=0.0;
for (i=0; i<N; ++i) {
    s=s+a[i]*a[i];
}

\[B_C = 1 \text{ WORDS} / 2 \text{ FLOP} = 4 \text{ B/F}\]

Scalar – can be kept in register

s=0.0;
for (i=0; i<N; ++i) {
    s=s+a[i]*b[i];
}

\[B_C = 2 \text{ WORDS} / 2 \text{ FLOP} = 8 \text{ B/F}\]

Scalar – can be kept in register
Case study: OpenMP-parallel sparse matrix-vector multiplication

A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory
Sparse matrix-vector multiply (spMVM)

- Key ingredient in some matrix diagonalization algorithms
  - Lanczos, Davidson, Jacobi-Davidson

- **Store only $N_{nz}$ nonzero elements** of matrix and RHS, LHS vectors with $N_r$ (number of matrix rows) entries

- "Sparse": $N_{nz} \sim N_r$

---

General case: some indirect addressing required!
**CRS matrix storage scheme**

- **val[]** stores all the nonzeros (length $N_{nz}$)
- **col_idx[]** stores the column index of each nonzero (length $N_{nz}$)
- **row_ptr[]** stores the starting index of each new row in **val[]** (length: $N_r$)
Case study: Sparse matrix-vector multiply

- **Strongly memory-bound for large data sets**
  - Streaming, with partially indirect access:

```c
!$OMP parallel do
do i = 1,N_r
   do j = row_ptr(i), row_ptr(i+1) - 1
      c(i) = c(i) + val(j) * b(col_idx(j))
   enddo
endo
do i = 1,N_r
   do j = row_ptr(i), row_ptr(i+1) - 1
      c(i) = c(i) + val(j) * b(col_idx(j))
   enddo
endo
!$OMP end parallel do
```

- Usually many spMVMs required to solve a problem

- **Following slides: Performance data on one 24-core AMD Magny Cours node**
Application: Sparse matrix-vector multiply

Strong scaling on one XE6 Magny-Cours node

Case 1: Large matrix

cnt, 62451x62451, non-zero: 4007383

Intrasocket bandwidth bottleneck

Good scaling across NUMA domains
Application: Sparse matrix-vector multiply
Strong scaling on one XE6 Magny-Cours node

Case 1: Large matrix

6 B/F model @ 12 Gbyte/s
Application: Sparse matrix-vector multiply

Strong scaling on one XE6 Magny-Cours node

Case 2: Medium size

Intrasocket bandwidth bottleneck

Working set fits in aggregate cache

mc2depi, 525825x525825, non-zero: 2100225
Application: Sparse matrix-vector multiply
Strong scaling on one Magny-Cours node

Case 3: Small size

No bandwidth bottleneck

Parallelization overhead dominates

(C) 2019 RRZE,LRZ
Conclusions from the spMVM benchmarks

- If the problem is “large”, bandwidth saturation on the socket is a reality
  - → There are “spare cores”
  - Very common performance pattern

- What to do with spare cores?
  - Let them idle (saves energy) or use them for other tasks, such as MPI communication

- Can we predict the saturated performance?
  - Balance-based performance modeling!

- Can we predict the saturation point?
  - … and why is this important? → advanced! (ECM Model)
Code optimization by data access optimization
Case 1: \( O(N)/O(N) \) Algorithms

- \( O(N) \) arithmetic operations vs. \( O(N) \) data access operations
- Examples: Scalar product, vector addition, sparse MVM etc.
- Performance limited by memory BW for large \( N \) (“memory bound”)
- Limited optimization potential for single loops
  - ...at most a constant factor for multi-loop operations
- Example: successive vector additions

\[
\begin{align*}
\text{do } & i=1,N \\
& a(i)=b(i)+c(i) \\
\text{enddo} \\
\text{do } & i=1,N \\
& z(i)=b(i)+e(i) \\
\text{enddo}
\end{align*}
\]

Loop fusion

\[
\begin{align*}
\text{do } & i=1,N \\
& a(i)=b(i)+c(i) \\
& z(i)=b(i)+e(i) \\
\text{enddo}
\end{align*}
\]

\( B_c = (3+1)/1 \) W/F = 32 B/F

\( B_c = 7/2 \) W/F = 28 B/F

Fusing different loops allows \( O(N) \) data reuse from registers

No optimization potential for either loop
Data access – general guidelines

- **Case 2: O(N²)/O(N²) algorithms**
  - Examples: dense matrix-vector multiply, matrix addition, dense matrix transposition etc.
  - Nested loops
  - Memory bound for large N
  - Some optimization potential (at most constant factor)
  - Can often enhance code balance by outer loop unrolling
  - Example: dense matrix-vector multiplication

```plaintext
\[
\begin{align*}
\text{do } i=1,N \\
\hspace{1em} \text{do } j=1,N \\
\hspace{2em} c(i) &= c(i) + a(j,i) \cdot b(j) \\
\text{enddo} \\
\text{enddo}
\end{align*}
\]

Naïve version loads \( b[] \) \( N \) times!
Data access – general guidelines

- **O(N²)/O(N²) algorithms cont’d**
  - “Unroll & jam” optimization (or “outer loop unrolling”)

```fortran
! Unroll & jam optimization (or "outer loop unrolling")
!
! do i=1,N
!   do j=1,N
!     c(i)=c(i)+a(j,i)*b(j)
!   enddo
! enddo
!
! do i=1,N,2
!   do j=1,N
!     c(i) =c(i) +a(j,i) *b(j)
!   enddo
! enddo
! do j=1,N
!   c(i+1)=c(i+1)+a(j,i+1)*b(j)
! enddo
! enddo
```

- **b(j) can be re-used once from register → save 1 data transfer**

**Lowers B_c from 8 to 6 B/F**
Data access – general guidelines

- **O(N²)/O(N²) algorithms cont’d**
  - Data access pattern for 2-way unrolled dense MVM:
    - Code balance can still be enhanced by more aggressive unrolling (i.e., m-way instead of 2-way)
    - Significant code bloat (try to use compiler directives if possible)
      - Ultimate limit: \( \mathbf{b}[\cdot] \) only loaded once from memory \( (B_c \approx \frac{1}{2} W/F) \)
      - Beware: CPU registers are a limited resource
      - Excessive unrolling can cause register spills
    - Large cache \( \rightarrow \mathbf{b}[\cdot] \) may be in cache even without unrolling!
Case 3: $O(N^3)/O(N^2)$ algorithms

- Most favorable case – computation outweighs data traffic by factor of $N$
- Examples: Dense matrix diagonalization, dense matrix-matrix multiplication
- Huge optimization potential: proper optimization can render the problem cache-bound if $N$ is large enough
- Example: dense matrix-matrix multiplication

Core task: dense MVM ($O(N^2)/O(N^2)$)

→ memory bound, but there is another loop level!
Case Study: The Jacobi Smoother

The basics in two dimensions
Roofline performance analysis and modeling
Jacobi Iteration in 2D for the heat equation

**Basics**

- Discretization of partial differential equation $\Delta \Phi = 0$
- Discretize equation on a finite set of equidistant mesh points
- Solve with Dirichlet boundary conditions using Jacobi iteration scheme

2D stencil update

$$B_{i,k} = \frac{1}{4} \cdot (A_{i-1,k} + A_{i+1,k} + A_{i,k+1} + A_{i,k-1})$$

Naive analysis: 4 FLOPs, 4 LD, 1 ST
A Jacobi smoother

- Laplace equation in 2D: \[ \Delta \Phi = 0 \]

- Solve with Dirichlet boundary conditions using Jacobi iteration scheme:

```plaintext
double precision, dimension(0:imax+1,0:kmax+1,0:1) :: phi
integer :: t0,t1
T0 = 0 ; t1 = 1
DO IT = 1,ITMAX ! choose suitable number of sweeps
  DO K = 1,KMAX
    DO I = 1,IMAX
      ! four flops, one store, four loads
      PHI(I,K,T1) = ( PHI(I+1,K,T0) + PHI(I-1,K,T0) \\
                     + PHI(I,K+1,T0) + PHI(I,K-1,T0) ) * 0.25
    END DO
  END DO
  ! swap arrays
  I = I0 ; T0 = T1 ; T1 = I
END DO
WRITE ALLOCATE: RD + WR PHI(I,K,T1)
```

Reuse when computing \( \phi(i+2,k,t1) \)

Naive balance (incl. write allocate):

\[ \text{phi}(:, :, t0): 3 \text{ reads} + \text{phi}(:, :, t1): 1 \text{ write} + 1 \text{ write-allocate} \]

\[ B_C = 5 \text{ W} / 4 \text{ FLOPs} = 10 \text{ B/F} \]
Performance metrics: 2D stencil

- **Alternative implementation ("Macho FLOP version")**

  ```
  do k = 1, kmax
    do i = 1, imax
      phi(i, k, t1) = 0.25 * phi(i+1, k, t0) + 0.25 * phi(i-1, k, t0)
      + 0.25 * phi(i, k+1, t0) + 0.25 * phi(i, k-1, t0)
    enddo
  enddo
  ```

- MFlops/sec increases by 7/4 but time to solution remains the same

- Better metric (for many iterative stencil schemes):
  Lattice Site Updates per Second (LUPs/sec)

2D Jacobi example: Compute LUPs/sec metric via

\[
P[LUPs \, / \, s] = \frac{it_{\text{max}} \cdot im_{\text{max}} \cdot k_{\text{max}}}{T_{\text{wall}}}\]
Balance metric: 2 D stencil and the cache

If cache is too small to hold three rows:

\[
\begin{align*}
\phi(:,:,t0): & \text{ 2 reads} + \\
\phi(:,:,t1): & \text{ 1 write + 1 write-allocate}
\end{align*}
\]

\[\Rightarrow B_C = \frac{5 \text{ W}}{4 \text{ F}} = 40 \text{ B/LUP} \]

If cache is large enough to hold at least 3 rows (shaded region): Each \( \phi(:,:,t0) \) is loaded once from main memory and re-used 3 times from cache:

\[
\begin{align*}
\phi(:,:,t0): & \text{ 1 read} + \\
\phi(:,:,t1): & \text{ 1 write + 1 write-allocate}
\end{align*}
\]

\[\Rightarrow B_C = \frac{3 \text{ W}}{4 \text{ F}} = 24 \text{ B/LUP} \]
2D \rightarrow 3D

### 3D sweep:

\[
\begin{align*}
\text{do } & \text{k=1,kmax} \\
\text{do } & \text{j=1,jmax} \\
\text{do } & \text{i=1,imax} \\
\phi(i,j,k,t1) & = 1/6 \cdot (\phi(i-1,j,k,t0)+\phi(i+1,j,k,t0) \& \phi(i,j-1,k,t0)+\phi(i,j+1,k,t0) \& + \phi(i,j,k-1,t0)+\phi(i,j,k+1,t0))
\end{align*}
\]

\[
\text{enddo}
\]

\[
\text{enddo}
\]

\[
\text{enddo}
\]

### Best case balance: 1 read

1 write + 1 write alloc.

6 flops = 1 LUP

\[\rightarrow B_C = 0.5 \text{ W/F (24 bytes/LUP)}\]

### No 3-layer condition but 3 rows fit:

\[B_C = 5/6 \text{ W/F (40 bytes/LUP)}\]

### Worst case (3 rows do not fit):

\[B_C = 7/6 \text{ W/F (56 bytes/LUP)}\]
3D Jacobi solver

**Performance of vanilla code on one Interlagos chip (8 cores)**

- **Problem size: \( N^3 \)**

- **Cache**

- **Memory**

- **3 layers of source array drop out of L2 cache**

- **Roofline inappropriate for unsaturated case**

- **T=1**
- **T=2**
- **T=8**
- **Performance model (mem.)**

- **24B/update**
- **40B/update**

(C) 2019 RRZE, LRZ
Conclusions from the Jacobi example

- We have **made sense** of the memory-bound **performance** vs. **problem size**
  - “Layer conditions” lead to **predictions of code balance**
  - Achievable memory bandwidth is input parameter

- The model works only if the **bandwidth is “saturated”**
  - In-cache modeling is more involved

- **Optimization == reducing the code balance by code transformations**
  - See below
Data access optimizations for the Jacobi smoother
Remember the 3D Jacobi solver on Interlagos?

- 3 layers of source array drop out of L2 cache
- → avoid through spatial blocking!
Jacobi iteration (2D): Spatial Blocking

Layer condition broken because $i_{\text{max}}$ too large

Idea: Rearrange updates so inner dimension becomes smaller; update blocks in turn

Boundary overhead: Often insignificant, but keep an eye on it!
Jacobi iteration (3D): Spatial blocking

- **Implementation:**
  ```fortran
  do ioffset=1,imax,iblock
    do joffset=1,jmax,jblock
      do k=1,kmax
        do j=joffset, min(jmax,joffset+jblock-1)
          do i=ioffset, min(imax,ioffset+iblock-1)
            phi(i,j,k,t1) = ( phi(i-1,j,k,t0)+phi(i+1,j,k,t0)
                             + ... + phi(i,j,k-1,t0)+phi(i,j,k+1,t0) )/6.d0
          enddo
        enddo
      enddo
    enddo
  enddo
  ```

  **Loop over i-blocks**

  **Loop over j-blocks**

- **Guidelines:**
  - If possible, keep inner block size (loop length) long
  - **Block sizes** (`iblock`, `jblock`) must be small enough to fulfill “layer condition”
  - Cache size is a hard limit!
  - Blocking loops may have some impact on memory page placement

- **Layer condition:**
  \[ 3 \cdot iblock \cdot jblock \cdot 8 \text{ bytes} \leq \text{CACHE\_SIZE\_PER\_THREAD}/2 \]
Profiling
Key question

- How do I know where my code spends most of its time?
  - This is called “profiling”
  - Many (free and commercial) tools exist
  - C++ code is notoriously hard to profile
    - Overloaded operators, tiny methods
Profiling with gprof

- Basic profiling tool under Linux: \texttt{gprof}

- Compiling for a profiling run

  \texttt{icpc -pg} ......

- After running the binary, a file \texttt{gmon.out} is written to the current directory

- Human readable output:

  \texttt{gprof a.out}

- Inlining should be disabled for profiling
  
  * But then the executed code isn’t what it should be…*
Example with wrapped `double` class:

```cpp
class D {
  double d;
public:
  D(double _d=0) : d(_d) {}
  D operator+(const D& o) {
    D r;
    r.d = d+o.d;
    return r;
  }
  operator double() {
    return d;
  }
};
```

Main program:

```cpp
const int n=10000000;
D a[n],b[n];
D sum;
for(int i=0; i<n; ++i) a[i] = b[i] = 1.5;
double s = timestamp();
for(int k=0; k<10; ++k) {
  for(int i=0; i<n; ++i)
    sum = sum + a[i] + b[i];
}
```
Profiling with gprof
Example profiler output

- `icpc -O3 -pg perf.cc`

% cumulative  self    self    total
time  seconds  seconds  calls  Ts/call  Ts/call  name
101.01  0.41    0.41

- `icpc -O3 -fno-inline -pg perf.cc`

% cumulative  self    self    total
time  seconds  seconds  calls  ns/call  ns/call  name
46.44  0.59    0.59  200000000  2.93   4.48  D::operator+(D const&)
29.63  0.96    0.37  240000001  1.56   1.56  D::D(double)
24.82  1.27    0.31

- But where did the time *actually* go?
  - Butterfly (callgraph) profile also available
  - Real problem also with use of libraries (STL!)
  - Sometimes you have to roll your own little profiler (good idea!)
Manual profiling with a timer class

- Measuring walltime on UNIX (-like) systems
  - Stay away from CPU time – it’s evil!
  - Use `gettimeofday()` to measure walltime:

```c
#include <sys/time.h>

double timestamp(void){
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return((double)(tp.tv_sec + tp.tv_usec/1000000.0));
}
```

- Code available in the exercise templates
Consequences from the saturation pattern for profiling

- Clearly distinguish between "saturating" and "scalable" performance on the chip level.
Consequences from the saturation pattern for profiling

- There is **no clear bottleneck for single-core** execution
- Code profile for single thread ≠ code profile for multiple threads
  - → Single-threaded profiling may be misleading
Some words about efficient C++
Writing efficient C++ code: Key questions

- **C++ code is notoriously hard to profile**
  - Overloaded operators, tiny methods

- **What are the most common performance problems with C++?**
  - Too much trust in the compiler’s ability to optimize code
  - Too many useless constructions and destructions
  - Too many temporary objects
  - Overloaded operator overuse
C++ performance pitfalls in a nutshell

- **Constructing and destroying objects**
  - Be aware that derived class ctors/dtors call their base class variants “recursively”
  - Try to initialize member objects right in the initialization list rather than the body:
    
    ```cpp
    X::X(point _p) : p(_p) {}
    ```
    
    is better than
    
    ```cpp
    X::X(point _p) { p = _p; }
    ```
  - Do not instantiate objects you don’t need:
    
    ```cpp
    if(a<b) {
    X x(a); // better here than outside
    x.someComplexStuff();
    }
    ```
C++ performance pitfalls in a nutshell

- **Virtual functions**
  - Virtual functions cannot be inlined
  - If there are many small, often-called virtual functions this may have a performance impact

- **Temporary objects**
  - Operator overloading tends to generate many temporaries

```cpp
a = b + c + d; // 2 temp objects

a = b; // no temp objects
a += c;
a += d;
```
C++ performance pitfalls in a nutshell

- **Unnecessary copies**
  - Returning an object and passing an object to a function require copies
    - Copying standard data types is most often harmless
  - Try to work with references/pointers where feasible
  - Use **reference counting** if copies cannot be avoided

- **Do not trust the compiler to do “the right thing”**
  - `vector<>::operator[]` is usually slower than plain array access
    - SIMD vectorization is main issue
  - Use **iterators** where possible
  - Tell the compiler that arrays do not overlap: `-fno-alias` and friends
Further thoughts…

- “But Expression Templates will do the trick, right?”


- “But if I write `inline` in front of all methods/functions, the compiler will optimize perfectly, right?”

No. Inlining will only reduce the most severe overheads. In fact, putting tight, performance-critical loops into separate C compilation units will often lead to better performance. ("Line Update Kernel" principle)

- Bottom line: If the code behaves in accordance with your performance model, you’re done.
The Golden Rule of Performance

Performance \cdot Flexibility = \text{const.}
Summary of scalar optimizations

- **#1 optimization in HPC:** *Avoid slow data paths!*
  - #0 optimization: reduce the amount of work…
  - #2 optimization: speed up in-core code (pipelining, SIMD, …)

- **Machine/code balance** *performance model* gives estimate for performance of “data streaming” loops

- **Typical balance-reducing optimizations** (avoiding slow data paths)
  - Loop fusion
  - Outer loop unrolling (unroll-and-jam)
  - Loop blocking

- Be aware of *saturation patterns* on the chip level!
- **Estimate potential gain** before embarking on complex code transformations!
- **The compiler will NOT fix it** for you (mostly)!
More on Performance Engineering

- **Upcoming tutorials on “Node-Level Performance Engineering”**
  - June 27-28, 2019 @ HLRS Stuttgart
  - December XX, 2019 @ TU Wien
  - Some time in spring 2020 @ LRZ Garching

**Abstract:**

This course teaches performance engineering approaches on the compute node level. “Performance engineering” as we define it is more than employing tools to identify hotspots and bottlenecks. It is about developing a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. Once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of optimizations can often be predicted. We introduce a “holistic” node-level performance engineering strategy centered around the roofline performance model and apply it to different algorithms from computational science. We also show that simple, easy to use tools bring us a long way towards deep insight into the interaction between software and hardware.