Exercise 2: Dense matrix vector multiplication

Preparation

Get the source files XX from YY. 

Get an interactive  job on the Emmy cluster with:

$ qsub -I -l nodes=1:ppn=40:f2.2:likwid  -l walltime=08:00:00

Setup the environment:

$ module load intel64/19.0up02
$ module load likwid/4.3.4

Preparation

Build the executable with:

$ icc -Ofast -xhost -std=c99 -D_GNU_SOURCE   -o ./dmvm ./dmvm.c

Test if it is working:

$ ./micro
$ likwid-pin -c S0:2 ./dmvm  0 5000

There is a helper script ./bench.pl that that allows to scan data set size. Use it as follows:

$ ./bench.pl <N columns>

You can generate a png plot of the result with gnuplot with:

$ gnuplot bench.plot

The output is expected in bench.dat!

Performance Engineering cycle

  1. Code Analysis
  2. Develop performance expectation
  3. Benchmarking
  4. Performance profiling
  5. Optimization
  6. Validation

Benchmarking

What do we expect based on the static code analysis? What does this mean for benchmark planning?

Set the number of columns to 10000 and scan the number of rows with (this should take less than 2m):

./bench.pl 10000 > bench.dat

What do we learn from the result? Is this what we expected? How can we measure what is going on?

Performance profiling

Instrument the source code with the Likwid marker API.

Build the new version with:

icc -Ofast -xhost -std=c99 -D_GNU_SOURCE -DLIKWID_PERFMON  -o ./dmvm $LIKWID_INC ./dmvm-marker.c  $LIKWID_LIB -llikwid

Test you new version using:

likwid-perfctr  -C S0:3 -g MEM_DP -m ./dmvm 15000 10000

Repeat the scan of row count using the following command:

./bench-perf.pl 10000  MEM

What is the result? Repeat for L3 and L2 groups.

What do we learn from profiling?

Optimization and Validation

What can we do about the performance drops?

Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.

Repeat benchmarking only (not setting the -DLIKWID_PERFMON define)  and validate the results with profiling.

Going parallel

Parallelize both the initial and optimised version with OpenMP. Take care for the reduction on y!

Benchmark the results and scale out within one socket. What are the results?

Last modified: Tuesday, 3 September 2019, 8:20 AM