The goal of the first assignment is to practice how to time and profile code, and to inspect some aspects that contribute to performance. This is an assortment of various, independent tasks.
In this task you connect C code with the assembler code generated from it.
Use the code from the previous assignment on naive benchmarking. Generate assembler from it for each of the optimization levels that you used before. Skim it to recognize the control flow (or absence of it) from the C code. Reconsider your previous speculations in light of the new information.
Note that it is not expected that you can properly read the assembler code; In fact, most of it will be unintelligible. Try to recognize constants from your C code and observe patterns from the little assembler that you have seen in the course.
Most relevant videos: Basic Assembler.
In this task you familiarize yourself with some basic use of two development tools, valgrind and GDB.
Most relevant videos: Debugging.
Implement a program with the following code:
int * as;
as = (int*)malloc(10*sizeof(int));
int sum = 0;
for ( int ix = 0; ix < 10; ++ix )
as[ix] = ix;
for ( int ix = 0; ix < 10; ++ix )
sum += as[ix];
printf("%d\n", sum);
free(as);
Compile the program with optimization level 0 and with the flag -g, then run the program with valgrind memcheck.
Repeat the compilation and running with valgrind after making one of the following modifications:
Discuss the respective valgrind output.
Please remember to remove the files vgcore.* generated by valgrind in order to not consume extraneous space on hard disc.
Implement a program based on the following code
int * as = NULL;
int sum = 0;
for ( int ix = 0; ix < 10; ++ix )
as[ix] = ix;
for ( int ix = 0; ix < 10; ++ix )
sum += as[ix];
free(as);
Compile the program with optimization level 0 and with the flag -g, then run the program with gdb. Follow the next steps:
In this task you will see the effect of automatic inlining. You have to compile programs with optimization level 2 (i.e. compiler flag -O2) in order to produce the effect.
Implement a function with signature
void
mul_cpx(
double *a_re,
double *a_im,
double *b_re,
double *b_im,
double *c_re,
double *c_im,
);
that computes the product of two complex numbers b and c, and stores it in a. The suffices re and im stand for the real and imaginary part of a,b,c respectively. This function may only multiply two complex numbers as opposed to several pairs.
You will implement this function several times, and it is important to use the same implementation strategy (but not the same code) each time.
Write a program, called same_file
, that
mul_cpx
in the same file as the main function.as_re
, as_im
, bs_re
, bs_im
, cs_im
, and cs_im
of doubles, each of length 30,000,bs_re
, bs_im
, cs_re
, and cs_im
(any kind of entries will do), and thenmul_cpx
, saving results to as (thus call mul_\cpx 30,000 times).Write a program, called different_file
, that
mul_cpx
in a file different from the file containing the main function, call for instance different_file_mul.c
. You may not include this file; provide it as an extra argument to the compiler instead.mul_cpx
into the same file as the main function.as_re
, as_im
, bs_re
, bs_im
, cs_im
, and cs_im
of doubles, each of length 30,000,bs_re
, bs_im
, cs_re
, and cs_im
(any kind of entries will do), and thenmul_cpx
, saving results to as.Write a program, called inlined_manually
, that
as_re
, as_im
, bs_re
, bs_im
, cs_im
, and cs_im
of doubles, each of length 30,000,bs_re
, bs_im
, cs_re
, and cs_im
(any kind of entries will do), and thenmul_cpx
into the main function by hand.Benchmark each of the three variants and reason about the difference. Be careful to print the element of as only after all benchmarking iterations have been performed in order to avoid cluttering the terminal. Note that 200.000 benchmark iterations barely suffice in the example implementation to see the difference.
Most relevant videos: Inlining.
Write a C program, called locality, that
Use the following naive implementations of row and column sums. You may inline them at will, they are here given as functions to make all types transparent.
void
row_sums(
double * sums,
const double ** matrix,
size_t nrs,
size_t ncs
)
{
for ( size_t ix = 0; ix < nrs; ++ix ) {
double sum = 0.;
for ( size_t jx = 0; jx < ncs; ++jx )
sum += matrix[ix][jx];
sums[ix] = sum;
}
}
void
col_sums(
double * sums,
const double ** matrix,
size_t nrs,
size_t ncs
)
{
for ( size_t jx = 0; jx < ncs; ++jx ) {
double sum = 0.;
for ( size_t ix = 0; ix < nrs; ++ix )
sum += matrix[ix][jx];
sums[jx] = sum;
}
}
Note that when calling a function that has an argument of type const double ** with a value of type double **, you can explicitly convert the argument to the correct type.
Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 5000 benchmark iterations were more than sufficient to see the relevant effect. Interpret the results with an eye to memory access patterns.
Add a reimplementation of the slower summation in a more effective way, and benchmark as before. The benchmarking results arising from this reimplement might be surprising at first sight and will be clarified in the next task.
Most relevant videos: Locality, Locality and Block Access.
Write a C program, called data_dependency
, that
Use the following implementations of row sums. You may inline them at will, they are here given as functions to make all types transparent.
void
row_sums(
double * sums,
const double ** matrix,
size_t nrs,
size_t ncs
)
{
for ( size_t ix = 0; ix < nrs; ++ix ) {
double sum = 0.;
for ( size_t jx = 0; jx < ncs; ++jx )
sum += matrix[ix][jx];
sums[ix] = sum;
}
}
void
row_sums_unrolled2(
double * sums,
const double ** matrix,
size_t nrs,
size_t ncs
)
{
for ( size_t ix = 0; ix < nrs; ++ix ) {
double sum0 = 0.;
double sum1 = 0.;
for ( size_t jx = 0; jx < ncs; jx += 2 ) {
sum0 += matrix[ix][jx];
sum1 += matrix[ix][jx+1];
}
sums[ix] = sum0 + sum1;
}
}
Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 5000 benchmark iterations were more than sufficient to see the relevant effect. Interpret the results with an eye to data dependency.
Add reimplementations row_sums_unrolled4
and row_sums_unrolled8
, imitating the provided pattern, and benchmark as before.
Finally, add an implementation row_sums_unrolled4
in which you use an array of four doubles
double sum[4] = {0.,0.,0.,0.};
instead of four separate variable to store the intermediate sums.
Most relevant videos: CPU Pipeline.
In this task you get familiar with the impact of indirect addressing. It is frequently used for sparse matrix and vector operators.
Write a program that add to a vector y the multiple of a vector x by a. Both vectors have length 1,000,000 and are allocated on the heap. The vectors must be addressed indirectly, through an index vector p, and you will use two different initializations for this index vector.
Concretely, implement a program base on the following code:
size_t size = 1000000;
for ( size_t kx = 0; kx < size; ++kx ) {
size_t jx = p[kx];
y[jx] += a * x[jx];
}
The two initializations of the index vector p are as follows. First, we use a linear initialization:
for ( size_t ix = 0; ix < size; ++ix )
p[ix] = ix;
Second, we use an initialization that jumps in steps of 1000.
size_t size_jump = 1000;
for ( size_t jx = 0, kx = 0; jx < size_jump; ++jx)
for ( size_t ix = jx; ix < size; ix += size_jump, ++kx)
p[ix] = kx;
Compile your program with optimization level 0, with optimization level 2, and with optimization level 2 including native architecture. Benchmark all variants and compare timings. Do not forget to print a random element of sums to avoid over-simplification by the compiler. In the example implementation 1000 benchmark iterations suffices to see the relevant effect.
Add an implementation alternative implementation that accesses x and y in linear order directly, avoiding indirect addressing, and benchmark as before.
Most relevant videos: Indirect Addressing.
In this task you familiarize yourself with the performance difference between HDD (Hard Disc Drive) and SSD (Solid State Drive).
Home directories on gantenbein are mapped to a RAID 5 configuration of HDDs. In addition there is an SSD available at /run/mount/scratch
, which is available to all users. When accessing /run/mount/scratch
, create a folder hpcuserXXX with XXX your hpc user number on gantenbein.
Most relevant videos: File System Operations, Files in C and Binary Data.
Implement a program that
Compile your program with optimization level 2. Benchmark each step and compare timings. It is not necessary to explicitly inhibit compiler simplifications in this task, but after each call to fwrite, you need to call fflush. Do not open and close the file inside the benchmark loop. Use no more than 10 iterations when benchmarking.
Next, in the terminal,
/usr/include
to your home directory, andinclude_copy
,/usr/include
to your directory in /run/mount/scratch
,include_copy
.Discuss all timing results and speculate about the general efficient use of HDD and SSD.
Like Assignment 0, this assignment is a preparation towards your working on Assignments 2 through 5. In particular, you will need the following:
From the part on assembler, it is useful to remember that the C code can significantly differ from the generated machine code. This can impact runtime performance, benchmarking, and profiling.
From the parts on valgrind and GDB, it is useful to remember valgrind and GDB as efficient means to detect incorrect memory access. When you observe a segmentation fault in your program, one of the first steps to locate it should involve the use of valgrind and GDB.
From the part on inlining, it is useful to remember that inlining of code potentially speeds up a function. This is particularly true if the function is small. To explicitly suggest to the compiler that a function be inlined you can prepend the declaration with "static inline".
From the part on locality, it is useful to remember that locality can make a dramatic difference in runtime. To see this effect, it is really crucial here to run the code sufficiently often (at least 100000 times) when benchmarking.
From the part on data dependency, it is useful to remember that dependency of code on previous results, in particular inside of for loops, can inhibit performance.
From the part on indirect addressing, it is useful to remember that indirect addressing always comes with extra cost. But that extra cost raises enormously when memory is accessed in a nonlinear way.
From the part on HDD and SSD, it is useful for future use beyond this course to remember that the advantage of SSD might be overcompensated by RAID.