Assignment 0: Warming up

Note: The Emmy cluster uses "Turbo Mode" with its processors. This means that, depending on the number of cores used and environmental conditions, a CPU chip can run at clock speeds higher than its "nominal" speed of 2.2 GHz (up to 3.0 GHz). This can make benchmarking difficult, and sometimes it is good to have a fixed frequency. You can achieve this by adding a property to the resource specification on the qsub command line:

$ qsub -l nodes=1:ppn=40:f2.2,walltime=01:00:00 ...

This will set the frequency to a fixed 2.2 GHz for all cores on the node(s) of the job. The allowed settings are:

turbo,noturbo,f2.2,f2.1,f2.0,f1.9,f1.8,f1.7,f1.6,f1.5,f1.4,f1.3,f1.2 .

So you can also run at a frequency which is lower than the nominal clock speed. "turbo" is the default.


Write a benchmark code that numerically integrates the function

f(x) = 4/(1+x2)

from 0 to 1. The result should be an approximation to π, of course. You may use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around xi with a width of Δx and a height of f(xi):

int SLICES = 1000000000;
double delta_x = ....;
for (int i=0; i < SLICES; i++) {
x = (i+0.5)*delta_x;
sum += (4.0 / (1.0 + x * x));
} Pi = sum * delta_x;

Complete the code fragment (translate to FORTRAN if you wish), add suitable timing functions (as described in the Getting Started Guide), make sure that it actually computes an approximation to π, and report runtime and performance in MFlop/s as obtained on one core of the Emmy cluster. Use the Intel compiler with the recommended compiler options. Include the relevant parts of your code in your submission.
  1. (40 credits) How many CPU cycles does the code take to execute per iteration of the loop? Describe how you arrived at this number from your measurements!
  2. (20 credits) Discuss different options for appropriate performance metrics in this code. What would you consider a good performance metric here?
  3. (40 credits) What happens if you change the code to use single-precision floating-point numbers? Are there any relevant code changes beyond data types? Why do you not get a reasonable accuracy for π? What about the performance?
Actually, this code can be used to measure the duration of a floating-point divide because the divide operation dominates its runtime. Everything else can be "hidden" behind the divide.