Hands-On: analysis of a molecular dynamics proxy app
A diagnostic performance analysis of the MiniMD proxy app
In this exercise you will quantify and compare the effectiveness of SIMD vectorization for a molecular dynamics benchmark.
You will investigate which algorithm ("half-neigh" or "full-neigh") is best suited for SIMD vectorization and quantify how effective the compiler can employ SIMD for them. While one could blindly try and be guided by time to solution only, the additional insight provided by hardware counter profiling gives confidence based on data what is going on and what could be further optimization options.
You can find the benchmark code in the MINIMD folder of the teacher account.:
$ cp -a ~ghager/MINIMD ~
Investigate the benchmark code
You may have a look at the instrumented force calculation variants. You find the functions in ./src/force_lj.cpp in the methods ForceLJ::compute_halfneigh line 79-139 and in ForceLJ::compute_fullneigh line 148-204. Which of them do you think is better suited for SIMD vectorization?
Examine build settings in
include_ICC.mk. Build by calling
$ makeYou can ignore the warnings :-).
If changing the build settings you need to $ make clean to make them take effect.
You need to generate three variants:
- without SIMD vectorization
- with SSE SIMD vectorization
- with AVX SIMD vectorization
It is recommended to:
- Edit include_ICC.mk and ensure only the OPT line with no-vec is uncommented.
$ make clean && make
$ mv miniMD-ICC miniMD-novec
Repeat for OPT SSE and AVX enabled and move binaries to `miniMD-SSE` and `miniMD-AVX`.
Caveat: For proper SIMD vectorization, maker sure to add the "-DUSE_SIMD" option to the compiler command line. For the nonvectorized version you should not use it.
Change to the ./data folder. To get an overview of available options, do:
$ cd data$ ../miniMD-<VERSION> -h
To run the benchmark, use:
$ ../miniMD-<VERSION> --half_neigh <0|1>
The number specifies if the half-neigh variant should be chosen (0 == off, 1 == on).
Hardware performance counter profiling
Use the FLOPS_DP performance group and note the event counts for every run:
In addition note the following derived metrics:
- Runtime (RDTSC)
- Vectorization ratio
Do this for the following runs:
$ likwid-perfctr -g FLOPS_DP -C S0:1 -m ../miniMD-novec --half_neigh 1
$ likwid-perfctr -g FLOPS_DP -C S0:1 -m ../miniMD-SSE --half_neigh 1
$ likwid-perfctr -g FLOPS_DP -C S0:1 -m ../miniMD-AVX --half_neigh 1
$ likwid-perfctr -g FLOPS_DP -C S0:1 -m ../miniMD-novec --half_neigh 0
$ likwid-perfctr -g FLOPS_DP -C S0:1 -m ../miniMD-SSE --half_neigh 0
$ likwid-perfctr -g FLOPS_DP -C S0:1 -m ../miniMD-AVX --half_neigh 0
Analysis of the profiling results
Look at the following metrics for each algorithm:
- Percentage of arithmetic floating point instructions (the useful work) to overall instructions (the processor work).
- Vectorization ratio as reported by Likwid, check the help text for the FLOPS_DP group how it is calculated
- CPI as the central metric of execution efficiency for the given instruction mix.
To compare the different versions setup the following relations:
- Total instructions of <SSE|AVX> version compared to novec (for HN / FN each)
- Arithmetic instructions of <SSE|AVX> version compared to novec (for HN / FN each)
- Total instructions of every version of FN compared to the same version of HN
You can do that with pen and paper, but we prepared an Excel sheet to speed things up. You can download it on the Moodle page.
- What happens if you turn on autovectorization for HN/FN?
- Can you interpret the runtimes? What role does the CPI value play?
- How do both versions compare with regard to actual user work and how does this change if applying SIMD?