High Performance Computing by Martin Raum

The goal of the first assignment is to practice how to time and profile code, and to inspect some aspects that contribute to performance. This is an assortment of various, independent tasks.

Assembler

In this task you connect C code with the assembler code generated from it.

Use the code from the previous assignment on naive benchmarking. Generate assembler from it for each of the optimization levels that you used before. Skim it to recognize the control flow (or absence of it) from the C code. Reconsider your previous speculations in light of the new information.

Note that it is not expected that you can properly read the assembler code; In fact, most of it will be unintelligible. Try to recognize constants from your C code and observe patterns from the little assembler that you have seen in the course.

Most relevant videos: Basic Assembler.

Valgrind and GDB

In this task you familiarize yourself with some basic use of two development tools, valgrind and GDB.

Most relevant videos: Debugging.

Valgrind

Implement a program with the following code:

int * as;
as = (int*)malloc(10*sizeof(int));
int sum = 0;

for ( int ix = 0; ix < 10; ++ix )
  as[ix] = ix;

for ( int ix = 0; ix < 10; ++ix )
  sum += as[ix];

printf("%d\n", sum);

free(as);

Compile the program with optimization level 0 and with the flag -g, then run the program with valgrind memcheck.

Repeat the compilation and running with valgrind after making one of the following modifications:

Comment out the initialization of as.
Comment out the freeing of as.
Amend the code with an additional free(as) at the end.

Discuss the respective valgrind output.

Please remember to remove the files vgcore.* generated by valgrind in order to not consume extraneous space on hard disc.

GDB

Implement a program based on the following code

int * as = NULL;
int sum = 0;

for ( int ix = 0; ix < 10; ++ix )
  as[ix] = ix;

for ( int ix = 0; ix < 10; ++ix )
  sum += as[ix];

free(as);

Compile the program with optimization level 0 and with the flag -g, then run the program with gdb. Follow the next steps:

Press "r" and enter to start running the program.
Once the execution stops (because of invalid memory access), type "focus src" and press enter.
Type "p as" and press enter to print the value of as at the time the program aborted execution.
You can quit the debugger by typing quit and then pressing enter.

Inlining

In this task you will see the effect of automatic inlining. You have to compile programs with optimization level 2 (i.e. compiler flag -O2) in order to produce the effect.

Implement a function with signature

void
mul_cpx(
    double *a_re,
    double *a_im,
    double *b_re,
    double *b_im,
    double *c_re,
    double *c_im,
    );

that computes the product of two complex numbers b and c, and stores it in a. The suffices re and im stand for the real and imaginary part of a,b,c respectively. This function may only multiply two complex numbers as opposed to several pairs.

You will implement this function several times, and it is important to use the same implementation strategy (but not the same code) each time.

Write a program, called same_file, that
- contains the declaration and definition of mul_cpx in the same file as the main function.
- generates vectors as_re, as_im, bs_re, bs_im, cs_im, and cs_im of doubles, each of length 30,000,
- generates entries for bs_re, bs_im, cs_re, and cs_im (any kind of entries will do), and then
- multiplies the entries of bs and cs using mul_cpx, saving results to as (thus call mul_\cpx 30,000 times).
- print one random entry of as to the terminal.
Write a program, called different_file, that
- contains the definition of mul_cpx in a file different from the file containing the main function, call for instance different_file_mul.c. You may not include this file; provide it as an extra argument to the compiler instead.
- write the declaration of mul_cpx into the same file as the main function.
- generates vectors as_re, as_im, bs_re, bs_im, cs_im, and cs_im of doubles, each of length 30,000,
- generates entries for bs_re, bs_im, cs_re, and cs_im (any kind of entries will do), and then
- multiplies the entries of bs and cs using mul_cpx, saving results to as.
- print one random entry of as to the terminal.
Write a program, called inlined_manually, that
- generates vectors as_re, as_im, bs_re, bs_im, cs_im, and cs_im of doubles, each of length 30,000,
- generates entries for bs_re, bs_im, cs_re, and cs_im (any kind of entries will do), and then
- multiplies the entries of bs and cs saving results to as by inserting the implementation of mul_cpx into the main function by hand.
- print one random entry of as to the terminal.

Benchmark each of the three variants and reason about the difference. Be careful to print the element of as only after all benchmarking iterations have been performed in order to avoid cluttering the terminal. Note that 200.000 benchmark iterations barely suffice in the example implementation to see the difference.

Most relevant videos: Inlining.

Locality

Write a C program, called locality, that

creates 1000 x 1000 doubles,
stores them in a matrix in row major order,
computes the row and the column sums.

Use the following naive implementations of row and column sums. You may inline them at will, they are here given as functions to make all types transparent.

void
row_sums(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
)
{
  for ( size_t ix = 0; ix < nrs; ++ix ) {
    double sum = 0.;
    for ( size_t jx = 0; jx < ncs; ++jx )
      sum += matrix[ix][jx];
    sums[ix] = sum;
  }
}

void
col_sums(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
  )
{
  for ( size_t jx = 0; jx < ncs; ++jx ) {
    double sum = 0.;
    for ( size_t ix = 0; ix < nrs; ++ix )
      sum += matrix[ix][jx];
    sums[jx] = sum;
  }
}

Note that when calling a function that has an argument of type const double ** with a value of type double **, you can explicitly convert the argument to the correct type.

Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 5000 benchmark iterations were more than sufficient to see the relevant effect. Interpret the results with an eye to memory access patterns.

Add a reimplementation of the slower summation in a more effective way, and benchmark as before. The benchmarking results arising from this reimplement might be surprising at first sight and will be clarified in the next task.

Most relevant videos: Locality, Locality and Block Access.

Data dependency

Write a C program, called data_dependency, that

creates 1000 x 1000 doubles,
stores them in a matrix in row major order,
computes the row sums.

Use the following implementations of row sums. You may inline them at will, they are here given as functions to make all types transparent.

void
row_sums(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
)
{
  for ( size_t ix = 0; ix < nrs; ++ix ) {
    double sum = 0.;
    for ( size_t jx = 0; jx < ncs; ++jx )
      sum += matrix[ix][jx];
    sums[ix] = sum;
  }
}

void
row_sums_unrolled2(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
)
{
  for ( size_t ix = 0; ix < nrs; ++ix ) {
    double sum0 = 0.;
    double sum1 = 0.;
    for ( size_t jx = 0; jx < ncs; jx += 2 ) {
      sum0 += matrix[ix][jx];
      sum1 += matrix[ix][jx+1];
    }
    sums[ix] = sum0 + sum1;
  }
}

Add reimplementations row_sums_unrolled4 and row_sums_unrolled8, imitating the provided pattern, and benchmark as before.

Finally, add an implementation row_sums_unrolled4 in which you use an array of four doubles

double sum[4] = {0.,0.,0.,0.};

instead of four separate variable to store the intermediate sums.

Most relevant videos: CPU Pipeline.

Indirect addressing

In this task you get familiar with the impact of indirect addressing. It is frequently used for sparse matrix and vector operators.

Write a program that add to a vector y the multiple of a vector x by a. Both vectors have length 1,000,000 and are allocated on the heap. The vectors must be addressed indirectly, through an index vector p, and you will use two different initializations for this index vector.

Concretely, implement a program base on the following code:

size_t size = 1000000;
for ( size_t kx = 0; kx < size; ++kx ) {
  size_t jx = p[kx];
  y[jx] += a * x[jx];
}

The two initializations of the index vector p are as follows. First, we use a linear initialization:

for ( size_t ix = 0; ix < size; ++ix  )
  p[ix] = ix;

Second, we use an initialization that jumps in steps of 1000.

size_t size_jump = 1000;
for ( size_t jx = 0, kx = 0; jx < size_jump; ++jx)
  for ( size_t ix = jx; ix < size; ix += size_jump, ++kx)
    p[ix] = kx;

Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 1000 benchmark iterations suffices to see the relevant effect.

Add an implementation alternative implementation that accesses x and y in linear order directly, avoiding indirect addressing, and benchmark as before.

Most relevant videos: Indirect Addressing.

Writing to and reading from HDD and SSD

In this task you familiarize yourself with the performance difference between HDD (Hard Disc Drive) and SSD (Solid State Drive).

Home directories on gantenbein are mapped to a RAID 5 configuration of HDDs. In addition there is an SSD available at /run/mount/scratch, which is available to all users. When accessing /run/mount/scratch, create a folder hpcuserXXX with XXX your hpc user number on gantenbein.

Most relevant videos: File System Operations, Files in C and Binary Data.

Writing and reading in C

Implement a program that

writes and
reads as binary data the first 1048576 = 2^20 integers
individually and
as whole array to and from a file
on HDD and
on SSD.

Compile your program with optimization level 2. Benchmark each step and compare timings. It is not necessary to explicitly inhibit compiler simplifications in this task, but after each call to fwrite, you need to call fflush. Do not open and close the file inside the benchmark loop. Use no more than 10 iterations when benchmarking.

Writing and reading in bash

Next, in the terminal,

copy the folder /usr/include to your home directory, and
measure the time of copying 10 times this copy to another folder called include_copy,
copy the folder /usr/include to your directory in /run/mount/scratch,
measure the time of copying 10 times this copy to another folder called include_copy.

Discuss all timing results and speculate about the general efficient use of HDD and SSD.

Learning Summary

Like Assignment 0, this assignment is a preparation towards your working on Assignments 2 through 5. In particular, you will need the following:

From the part on assembler, it is useful to remember that the C code can significantly differ from the generated machine code. This can impact runtime performance, benchmarking, and profiling.
From the parts on valgrind and GDB, it is useful to remember valgrind and GDB as efficient means to detect incorrect memory access. When you observe a segmentation fault in your program, one of the first steps to locate it should involve the use of valgrind and GDB.
From the part on inlining, it is useful to remember that inlining of code potentially speeds up a function. This is particularly true if the function is small. To explicitly suggest to the compiler that a function be inlined you can prepend the declaration with "static inline".
From the part on locality, it is useful to remember that locality can make a dramatic difference in runtime. To see this effect, it is really crucial here to run the code sufficiently often (at least 100000 times) when benchmarking.
From the part on data dependency, it is useful to remember that dependency of code on previous results, in particular inside of for loops, can inhibit performance.
From the part on indirect addressing, it is useful to remember that indirect addressing always comes with extra cost. But that extra cost raises enormously when memory is accessed in a nonlinear way.
From the part on HDD and SSD, it is useful for future use beyond this course to remember that the advantage of SSD might be overcompensated by RAID.

Assignment 1: Optimization