Hands-On: counting memory bandwidth and traffic

Task: Explore the behavior of a memory benchmark using likwid-perfctr

In this exercise you will analyze and predict the data access pattern of typical streaming patterns and validate your prediction with  `likwid-perfctr` measurements.

Preparation

You can find the benchmark code in the BWBENCH folder of the teacher account.Copy it again since there might have been updates.

Compile benchmark

Compile a threaded OpenMP binary with optimizing flags:

$ icc -Ofast -xHost -fno-alias -std=c99  -qopenmp  -o bwBench  bwBench.c

or

$ ifort -Ofast -xHost -fno-alias -qopenmp  -o bwBench  bwBench.f90

Investigate the benchmark code

Analyze the `bwBench `source code and derive the relation between loads and stores for all benchmark cases.

Take into account possible write allocate transfers!

Run benchmark

Data traffic analysis

Execute the benchmark using all cores on one socket:

$ likwid-pin -c S0:0-23 ./bwBench

Questions:

  1. Why is the sustained bandwidth different for the different benchmark cases?
  2. Which correction factor must be applied for each reported number to get the actual bandwidth?


Task: Measure the real data traffic using `likwid-perfctr`.

Optional (if enough time): Instrument the binary yourself using the LIKWID Marker API.

or use the provided bwBench-likwid.{c,f90} .

Compile the code with:

$ icc -Ofast -xHost -fno-alias -std=c99 -qopenmp  -DLIKWID_PERFMON \
       ${LIKWID_INC} -o bwBench-perf  bwBench-likwid.c ${LIKWID_LIB} -llikwid
or
$ ifort -Ofast -xHost -fno-alias -qopenmp  \
       ${LIKWID_INC} -o bwBench-perf  bwBench-likwid.f90 ${LIKWID_LIB} -llikwid

Execute measurement with:

$ likwid-perfctr  -g MEM_DP -C S0:0-9 -m ./bwBench-perf

Look at the following derived metrics:

  • Memory read data volume
  • Memory write data volume
  • Memory data volume
  • Memory bandwidth

Questions:

  1. Do the measured load to store ratios meet your previous analysis?
  2. How do the measured bandwidths compare to those reported by the benchmark? (Hint: When the markers are active, the benchmark reports smaller bandwidths due to the overhead. Always compare with the markerless run!)

To solve any issues with the "copy" benchmark, recompile with the flag `-ffreestanding`.


ccNUMA bandwidth scalability (for after the ccNUMA lecture)

Execute the benchmark using all cores on one socket:

$ likwid-pin -c S0:0-23 ./bwBench

Execute the benchmark using all cores on two sockets:

$ likwid-pin -c N:0-47 ./bwBench

Does the bandwidth scale?

Optionally you may verify the distribution of the total data volume on different memory interfaces with `likwid-perfctr`.



Last modified: Thursday, 11 March 2021, 8:08 AM