The goal of the first assignment is to practice how to time and profile code, and to inspect some aspects that contribute to performance. This is an assortment of various, independent tasks.

Assembler

In this task you connect C code with the assembler code generated from it.

Use the code from the previous assignment on naive benchmarking. Generate assembler from it for each of the optimization levels that you used before. Skim it to recognize the control flow (or absence of it) from the C code. Reconsider your previous speculations in light of the new information.

Note that it is not expected that you can properly read the assembler code; In fact, most of it will be unintelligible. Try to recognize constants from your C code and observe patterns from the little assembler that you have seen in the course.

Most relevant videos: Basic Assembler.

Valgrind and GDB

In this task you familiarize yourself with some basic use of two development tools, valgrind and GDB.

Most relevant videos: Debugging.

Valgrind

Implement a program with the following code:

int * as;
as = (int*)malloc(10*sizeof(int));
int sum = 0;

for ( int ix = 0; ix < 10; ++ix )
  as[ix] = ix;

for ( int ix = 0; ix < 10; ++ix )
  sum += as[ix];

printf("%d\n", sum);

free(as);

Compile the program with optimization level 0 and with the flag -g, then run the program with valgrind memcheck.

Repeat the compilation and running with valgrind after making one of the following modifications:

Discuss the respective valgrind output.

Please remember to remove the files vgcore.* generated by valgrind in order to not consume extraneous space on hard disc.

GDB

Implement a program based on the following code

int * as = NULL;
int sum = 0;

for ( int ix = 0; ix < 10; ++ix )
  as[ix] = ix;

for ( int ix = 0; ix < 10; ++ix )
  sum += as[ix];

free(as);

Compile the program with optimization level 0 and with the flag -g, then run the program with gdb. Follow the next steps:

Inlining

In this task you will see the effect of automatic inlining. You have to compile programs with optimization level 2 (i.e. compiler flag -O2) in order to produce the effect.

Implement a function with signature

void
mul_cpx(
    double *a_re,
    double *a_im,
    double *b_re,
    double *b_im,
    double *c_re,
    double *c_im,
    );

that computes the product of two complex numbers b and c, and stores it in a. The suffices re and im stand for the real and imaginary part of a,b,c respectively. This function may only multiply two complex numbers as opposed to several pairs.

You will implement this function several times, and it is important to use the same implementation strategy (but not the same code) each time.

Benchmark each of the three variants and reason about the difference. Be careful to print the element of as only after all benchmarking iterations have been performed in order to avoid cluttering the terminal. Note that 200.000 benchmark iterations barely suffice in the example implementation to see the difference.

Most relevant videos: Inlining.

Locality

Write a C program, called locality, that

Use the following naive implementations of row and column sums. You may inline them at will, they are here given as functions to make all types transparent.

void
row_sums(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
)
{
  for ( size_t ix = 0; ix < nrs; ++ix ) {
    double sum = 0.;
    for ( size_t jx = 0; jx < ncs; ++jx )
      sum += matrix[ix][jx];
    sums[ix] = sum;
  }
}

void
col_sums(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
  )
{
  for ( size_t jx = 0; jx < ncs; ++jx ) {
    double sum = 0.;
    for ( size_t ix = 0; ix < nrs; ++ix )
      sum += matrix[ix][jx];
    sums[jx] = sum;
  }
}

Note that when calling a function that has an argument of type const double ** with a value of type double **, you can explicitly convert the argument to the correct type.

Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 5000 benchmark iterations were more than sufficient to see the relevant effect. Interpret the results with an eye to memory access patterns.

Add a reimplementation of the slower summation in a more effective way, and benchmark as before. The benchmarking results arising from this reimplement might be surprising at first sight and will be clarified in the next task.

Most relevant videos: Locality, Locality and Block Access.

Data dependency

Write a C program, called data_dependency, that

Use the following implementations of row sums. You may inline them at will, they are here given as functions to make all types transparent.

void
row_sums(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
)
{
  for ( size_t ix = 0; ix < nrs; ++ix ) {
    double sum = 0.;
    for ( size_t jx = 0; jx < ncs; ++jx )
      sum += matrix[ix][jx];
    sums[ix] = sum;
  }
}

void
row_sums_unrolled2(
  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
)
{
  for ( size_t ix = 0; ix < nrs; ++ix ) {
    double sum0 = 0.;
    double sum1 = 0.;
    for ( size_t jx = 0; jx < ncs; jx += 2 ) {
      sum0 += matrix[ix][jx];
      sum1 += matrix[ix][jx+1];
    }
    sums[ix] = sum0 + sum1;
  }
}

Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 5000 benchmark iterations were more than sufficient to see the relevant effect. Interpret the results with an eye to data dependency.

Add reimplementations row_sums_unrolled4 and row_sums_unrolled8, imitating the provided pattern, and benchmark as before.

Finally, add an implementation row_sums_unrolled4 in which you use an array of four doubles

double sum[4] = {0.,0.,0.,0.};

instead of four separate variable to store the intermediate sums.

Most relevant videos: CPU Pipeline.

Indirect addressing

In this task you get familiar with the impact of indirect addressing. It is frequently used for sparse matrix and vector operators.

Write a program that add to a vector y the multiple of a vector x by a. Both vectors have length 1,000,000 and are allocated on the heap. The vectors must be addressed indirectly, through an index vector p, and you will use two different initializations for this index vector.

Concretely, implement a program base on the following code:

size_t size = 1000000;
for ( size_t kx = 0; kx < size; ++kx ) {
  size_t jx = p[kx];
  y[jx] += a * x[jx];
}

The two initializations of the index vector p are as follows. First, we use a linear initialization:

for ( size_t ix = 0; ix < size; ++ix  )
  p[ix] = ix;

Second, we use an initialization that jumps in steps of 1000.

size_t size_jump = 1000;
for ( size_t jx = 0, kx = 0; jx < size_jump; ++jx)
  for ( size_t ix = jx; ix < size; ix += size_jump, ++kx)
    p[ix] = kx;

Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 1000 benchmark iterations suffices to see the relevant effect.

Add an implementation alternative implementation that accesses x and y in linear order directly, avoiding indirect addressing, and benchmark as before.

Most relevant videos: Indirect Addressing.

Writing to and reading from HDD and SSD

In this task you familiarize yourself with the performance difference between HDD (Hard Disc Drive) and SSD (Solid State Drive).

Home directories on gantenbein are mapped to a RAID 5 configuration of HDDs. In addition there is an SSD available at /run/mount/scratch, which is available to all users. When accessing /run/mount/scratch, create a folder hpcuserXXX with XXX your hpc user number on gantenbein.

Most relevant videos: File System Operations, Files in C and Binary Data.

Writing and reading in C

Implement a program that

Compile your program with optimization level 2. Benchmark each step and compare timings. It is not necessary to explicitly inhibit compiler simplifications in this task, but after each call to fwrite, you need to call fflush. Do not open and close the file inside the benchmark loop. Use no more than 10 iterations when benchmarking.

Writing and reading in bash

Next, in the terminal,

Discuss all timing results and speculate about the general efficient use of HDD and SSD.

Learning Summary

Like Assignment 0, this assignment is a preparation towards your working on Assignments 2 through 5. In particular, you will need the following: