I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the following code (I am aware that I should just scale the result by 'a' at the end; the point is to do both a multiply and an add within the loop - so far, the compiler hasn't figured out to factor out 'a'):
double sum(double a,double* X,int size)
{
double total = 0.0;
for(int i = 0; i < size; ++i)
{
total += a*X[i];
}
return total;
}
#define KB 1024
int main()
{
//Approximately half the L1 cache size of my machine
int operand_size = (32*KB)/(sizeof(double)*2);
printf("Operand size: %d\n", operand_size);
double* X = new double[operand_size];
fill(X,operand_size);
double seconds = timer();
double result;
int n_iterations = 100000;
for(int i = 0; i < n_iterations; ++i)
{
result = sum(3.5,X,operand_size);
//result += rand();
}
seconds = timer() - seconds;
double mflops = 2e-6*double(n_iterations*operand_size)/seconds;
printf("Vector size %d: mflops=%.1f, result=%.1f\n",operand_size,mflops,result);
return 0;
}
Note that the timer() and fill() routines are not included for brevity; their full source can be found here if you want to run the code:
Now, here is where it gets interesting. This is the output:
Operand size: 2048
Vector size 2048: mflops=588.8, result=-67.8
This is totally un-cached performance, despite the fact that all elements of X should be held in cache between loop iterations. Looking at the assembly code generated by:
g++ -O3 -S -fno-asynchronous-unwind-tables register_opt_example.cpp
I notice one oddity in the sum function loop:
L55:
movsd (%r12,%rax,8), %xmm0
mulsd %xmm1, %xmm0
addsd -72(%rbp), %xmm0
movsd %xmm0, -72(%rbp)
incq %rax
cmpq $2048, %rax
jne L55
The instructions:
addsd -72(%rbp), %xmm0
movsd %xmm0, -72(%rbp)
indicate that it is storing the value of "total" in sum() on the stack, and reading and writing it at every loop iteration. I modified the assembly so that this operand is kept in a a register:
...
addsd %xmm0, %xmm3
...
This small change creates a huge performance boost:
Operand size: 2048
Vector size 2048: mflops=1958.9, result=-67.8
tl;dr My question is: why does replacing a single memory location access with a register, speed up the code so much, given that the single location should be stored in L1 cache? What architectural factors make this possible? It seems very strange that writing one stack location repeatedly would completely destroy the effectiveness of a cache.
Appendix
My gcc version is:
Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)
My CPU is:
Intel Xeon X5650