Exercise 2: Dense matrix vector multiplication
Get the source files XX from YY.
Get an interactive job on the Emmy cluster with:
$ qsub -I -l nodes=1:ppn=40:f2.2:likwid -l walltime=08:00:00
Setup the environment:
$ module load intel64/19.0up02
$ module load likwid/4.3.4
Build the executable with:
$ icc -Ofast -xhost -std=c99 -D_GNU_SOURCE -o ./dmvm ./dmvm.c
Test if it is working:
$ likwid-pin -c S0:2 ./dmvm 0 5000
There is a helper script ./bench.pl that that allows to scan data set size. Use it as follows:
$ ./bench.pl <N columns>
You can generate a png plot of the result with gnuplot with:
$ gnuplot bench.plot
The output is expected in bench.dat!
Performance Engineering cycle
- Code Analysis
- Develop performance expectation
- Performance profiling
What do we expect based on the static code analysis? What does this mean for benchmark planning?
Set the number of columns to 10000 and scan the number of rows with (this should take less than 2m):
./bench.pl 10000 > bench.dat
What do we learn from the result? Is this what we expected? How can we measure what is going on?
Instrument the source code with the Likwid marker API.
Build the new version with:
icc -Ofast -xhost -std=c99 -D_GNU_SOURCE -DLIKWID_PERFMON -o ./dmvm $LIKWID_INC ./dmvm-marker.c $LIKWID_LIB -llikwid
Test you new version using:
likwid-perfctr -C S0:3 -g MEM_DP -m ./dmvm 15000 10000
Repeat the scan of row count using the following command:
./bench-perf.pl 10000 MEM
What is the result? Repeat for L3 and L2 groups.
What do we learn from profiling?
Optimization and Validation
What can we do about the performance drops?
Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.
Repeat benchmarking only (not setting the -DLIKWID_PERFMON define) and validate the results with profiling.
Parallelize both the initial and optimised version with OpenMP. Take care for the reduction on y!
Benchmark the results and scale out within one socket. What are the results?