## Exercise: A loop with a divide

We want to numerically integrate the function

f(x) = 4/(1+x

^{2})from 0 to 1. The result should be an approximation to π, of course. We use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around x

_{i}with a width of Δx and a height of f(x_{i}):int SLICES = 200000000;

double delta_x = 1./SLICES;

for (int i=0; i < SLICES; i++) {

Pi = sum * delta_x;

double delta_x = 1./SLICES;

for (int i=0; i < SLICES; i++) {

x = (i+0.5)*delta_x;

sum += (4.0 / (1.0 + x * x));

}sum += (4.0 / (1.0 + x * x));

Pi = sum * delta_x;

You can find example code (in FORTRAN and C) in folder DIV. The code reports the runtime of the loop in seconds and the computed value of π.

- Run the code. Assuming that the (non-pipelined) divide operation dominates the runtime of the code, can you estimate the latency for a FP divide in CPU cycles? Hint: You should set a definite clock speed for this experiment (e.g., 2.2 GHz).
- The SIMD vectorization is turned off in the Makefile (option -no-vec). What happens if you omit this option, allowing AVX vectorization? What can you conclude about the number of available divide units on the core?
- Parallelize the code with OpenMP and make sure that it still produces an approximation to π. How does the performance (i.e., inverse runtime) of the program scale as you increase the number of threads? Is this behavior to be expected?

Last modified: Wednesday, 8 April 2015, 9:59 PM