C++: Mysteriously huge speedup from keeping one operand in a register

Question

I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the following code (I am aware that I should just scale the result by 'a' at the end; the point is to do both a multiply and an add within the loop - so far, the compiler hasn't figured out to factor out 'a'):

double sum(double a,double* X,int size)
{
    double total = 0.0;
    for(int i = 0;  i < size; ++i)
    {
        total += a*X[i];
    }
    return total;
}

#define KB 1024
int main()
{
    //Approximately half the L1 cache size of my machine
    int operand_size = (32*KB)/(sizeof(double)*2);
    printf("Operand size: %d\n", operand_size);
    double* X = new double[operand_size];
    fill(X,operand_size);

    double seconds = timer();
    double result;
    int n_iterations = 100000;
    for(int i = 0; i < n_iterations; ++i)
    {
        result = sum(3.5,X,operand_size);
        //result += rand();  
    }
    seconds = timer() - seconds; 

    double mflops = 2e-6*double(n_iterations*operand_size)/seconds;
    printf("Vector size %d: mflops=%.1f, result=%.1f\n",operand_size,mflops,result);
    return 0;
}

Note that the timer() and fill() routines are not included for brevity; their full source can be found here if you want to run the code:

http://codepad.org/agPWItZS

Now, here is where it gets interesting. This is the output:

Operand size: 2048
Vector size 2048: mflops=588.8, result=-67.8

This is totally un-cached performance, despite the fact that all elements of X should be held in cache between loop iterations. Looking at the assembly code generated by:

g++ -O3 -S -fno-asynchronous-unwind-tables register_opt_example.cpp

I notice one oddity in the sum function loop:

L55:
    movsd   (%r12,%rax,8), %xmm0
    mulsd   %xmm1, %xmm0
    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)
    incq    %rax
    cmpq    $2048, %rax
    jne L55

The instructions:

    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)

indicate that it is storing the value of "total" in sum() on the stack, and reading and writing it at every loop iteration. I modified the assembly so that this operand is kept in a a register:

...
addsd   %xmm0, %xmm3
...

This small change creates a huge performance boost:

Operand size: 2048
Vector size 2048: mflops=1958.9, result=-67.8

tl;dr My question is: why does replacing a single memory location access with a register, speed up the code so much, given that the single location should be stored in L1 cache? What architectural factors make this possible? It seems very strange that writing one stack location repeatedly would completely destroy the effectiveness of a cache.

Appendix

My gcc version is:

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

My CPU is:

Intel Xeon X5650

And the compiler was dumb enough to not keep it in register? I'm impressed... — Mysticial, Mar 27 '13 at 17:33
@Mysticial: In fairness to gcc folk, `gcc 4.7.2` does keep it in a register. — NPE, Mar 27 '13 at 17:35
@NPE Ah ok. That helps me keep *some* faith in the compiler... — Mysticial, Mar 27 '13 at 17:36
@leemes: It's just default Apple one on this box, which I can't change cause it breaks some other software :/ I recognize a better version of gcc will probably be faster; my question is more about WHY that version is faster. — Sam Manzer, Mar 27 '13 at 17:46
Even for an oldish compiler as gcc 4.2 this is surprisingly bad. I think there prefered compiler on Mac nowadays is clang, which is probably more uptodate. Also, to always have the best optimization, use `-O3 -march=native` if your version of gcc supports that already. — Jens Gustedt, Mar 27 '13 at 18:16

score 63 · Accepted Answer · edited May 23 '17 at 12:02

63

It's likely a combination of a longer dependency chain, along with Load Misprediction*.

Longer Dependency Chain:

First, we identify the critical dependency paths. Then we look at the instruction latencies provided by: http://www.agner.org/optimize/instruction_tables.pdf (page 117)

In the unoptimized version, the critical dependency path is:

addsd -72(%rbp), %xmm0
movsd %xmm0, -72(%rbp)

Internally, it probably breaks up into:

load (2 cycles)
addsd (3 cycles)
store (3 cycles)

If we look at the optimized version, it's just:

addsd (3 cycles)

So you have 8 cycles vs. 3 cycles. Almost a factor of 3.

I'm not sure how sensitive the Nehalem processor line is to store-load dependencies and how well it does forwarding. But it's reasonable to believe that it's not zero.

Load-store Misprediction:

Modern processors use prediction in more ways you can imagine. The most famous of these is probably Branch Prediction. One of the lesser known ones is Load Prediction.

When a processor sees a load, it will immediately load it before all pending writes finish. It will assume that those writes will not conflict with the loaded values.

If an earlier write turns out to conflict with a load, then the load must be re-executed and the computation rolled back to the point of the load. (in much the same way that branch mispredictions roll back)

How it is relevant here:

Needless to say, modern processors will be able to execute multiple iterations of this loop simultaneously. So the processor will be attempting to perform the load (addsd -72(%rbp), %xmm0) before it finishes the store (movsd %xmm0, -72(%rbp)) from the previous iteration.

The result? The previous store conflicts with the load - thus a misprediction and a roll back.

_{*Note that I'm unsure of the name "Load Prediction". I only read about it in the Intel docs and they didn't seem to give it a name.}

edited May 23 '17 at 12:02

Community

1
1

answered Mar 27 '13 at 17:45

Mysticial

464,885
45
335
332

1

http://stackoverflow.com/q/10274355/56778 has some useful links that lead me to the same conclusion. – Jim Mischel Mar 27 '13 at 17:46
But there is also the loop, so the factor of approx. 3 is for adding to the result variable only. If you count the jump and loop condition too, the factor is approx. 2 I guess, but the timings show a factor of approx. 3.5... So I guess the number of cycles isn't the most important factor, but the dependency. – leemes Mar 27 '13 at 17:47
Note that the load could potentially contend for the same resources as `movsd (%r12,%rax,8), %xmm0`. So that could produce additional stalls. But it's hard to be sure without a cycle-accurate emulator. – Mysticial Mar 27 '13 at 17:49
It can pipeline some of this though? Overlap the add with subsequent loads? – Sam Manzer Mar 27 '13 at 17:52
1

@SamManzer The only load it can overlap with is `movsd (%r12,%rax,8), %xmm0`. It cannot overlap with the other load because it is dependent on it. Likewise, you can't load the next iteration early because it depends on the value you store to it in the current iteration. – Mysticial Mar 27 '13 at 17:53
@SamManzer I've updated my answer to include another (very likely) possibility that could be in effect. – Mysticial Mar 27 '13 at 18:02
Can you see load mis-predictions with hardware performance counters? – Sam Manzer Mar 27 '13 at 18:12
@SamManzer I've never tried. So I wouldn't know. These kind of "performance faults" are rarely encountered in properly written code so I've never really had to measure or deal with them. – Mysticial Mar 27 '13 at 18:16
+1 for getting at what I found in assembly and providing a better explanation of why it makes a difference here – UpAndAdam Mar 27 '13 at 21:04
The thing called "load mispredictions" are usually called something like "store forwarding speculation" or "store aliasing speculation" or "memory dependence speculation" or something like that. The CPU speculatively performs a later load before all earlier store addresses are known. The situation isn't as bad as you might imagine on modern CPUs: if the speculation turns out to wrong for the same load a few times, the CPU remembers it and stops speculating the load ahead of the earlier stores. See [here](http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation) for a deeper treatment. – BeeOnRope Jan 01 '18 at 07:35

UpAndAdam · Answer 2 · 2013-04-05T23:04:04.853

16

I would surmise the problem isn't in the cache/memory access but in the processor (execution of your code). There are several visible bottlenecks here.

Performance numbers here were based on the boxes I was using (either sandybridge or westmere)

Peak performance for scalar math is 2.7Ghz x2 FLOPS/Clock x2 since processor can do an add and multiply simultaneously. Theoretical efficiency of the code is 0.6/(2.7*2) = 11%

Bandwidth needed: 2 doubles per (+) and (x) -> 4bytes/Flop 4 bytes * 5.4GFLOPS = 21.6GB/s

If you know it was read recently its likely in L1 (89GB/s), L2 (42GB/s) or L3(24GB/s) so we can rule out cache B/W

The memory susbsystem is 18.9 GB/s so even in main memory, peak performance should approach 18.9/21.6GB/s = 87.5 %

May want to batch up requests (via unrolling) as early as possible

Even with speculative execution, tot += a *X[i] the adds will be serialized because tot(n) need to be eval'd before tot(n+1) can be kicked off

First unroll loop
move i by 8's and do

{//your func
    for( int i = 0; i < size; i += 8 ){
        tot += a * X[i];
        tot += a * X[i+1];
        ...
        tot += a * X[i+7];
    }
    return tot
}

Use multiple accumulators
This will break dependencies and allow us to avoid stalling on the addition pipeline

{//your func//
    int tot,tot2,tot3,tot4;
    tot = tot2 = tot3 = tot4 = 0
    for( int i = 0; i < size; i += 8 ) 
        tot  += a * X[i];
        tot2 += a * X[i+1];
        tot3 += a * X[i+2];
        tot4 += a * X[i+3];
        tot  += a * X[i+4];
        tot2 += a * X[i+5];
        tot3 += a * X[i+6];
        tot4 += a * X[i+7];
    }
    return tot + tot2 + tot3 + tot4;
}

UPDATE After running this on a SandyBridge box I have access to: (2.7GHZ SandyBridge with -O2 -march=native -mtune=native

Original code:

Operand size: 2048  
Vector size 2048: mflops=2206.2, result=61.8  
2.206 / 5.4 = 40.8%

Improved Code:

Operand size: 2048  
Vector size 2048: mflops=5313.7, result=61.8  
5.3137 / 5.4 = 98.4%

edited Apr 05 '13 at 23:04

answered Mar 27 '13 at 17:56

UpAndAdam

4,515
3
28
46

Just curious, where do you get your cache/memory bandwidth data? – Sam Manzer Mar 27 '13 at 18:04
it's been measured for our hardware realized afterwards, so YMMV :-) For nehalem I found this http://stackoverflow.com/questions/2353299/cache-bandwidth-per-tick-for-modern-cpus – UpAndAdam Mar 27 '13 at 19:12
This is interesting; so it is only achieving 40% of peak while in cache. Pretty neat; I will look at these different optimizations. – Sam Manzer Mar 27 '13 at 20:17
However, isn't it true that in the register and stack cases that I presented, the addition pipeline is fully serialized? So this would not explain the RELATIVE speedup between those two versions. – Sam Manzer Mar 27 '13 at 20:27
2

@SamManzer The addition pipeline can never be fully utilized in either case. The latency is 3 with throughput 1. So you need to run 3 of them at a time to max it out. But since there's only a single addition and it's on the critical path, the most you can get is 1/3 utilization of the addition pipeline. – Mysticial Mar 27 '13 at 20:31
1

@Sam Manzer You can find a great lesson on this on the open course from Stanford in performance computing on itunes U. one of the early cases is dot product / matrix multiplication which I derived some of this from. – UpAndAdam Mar 27 '13 at 21:20
Great point @mystical! That said, I'm not sure how any of this is really comparing L1 to memory, or whether or not our flops measure is actually accurate. – UpAndAdam Mar 27 '13 at 21:20
@SamManzer if you use memtest86 it will tell you the bandwidth to each level. On a core i3 box I built for my girlfriend for example the bandwidths are l1: 109752 MB/s. l2 49887 mb/s. l3 37415 MB/s and main memory 17894 Mb/s – UpAndAdam Apr 16 '13 at 04:50

NPE · Answer 3 · 2013-03-27T17:56:21.987

8

I can't actually reproduce this because my compiler (gcc 4.7.2) keeps total in a register.

I suspect the main reason for the slowness doesn't have to do with the L1 cache, but rather is due to the data dependency between the store in

movsd   %xmm0, -72(%rbp)

and the load on the subsequent iteration:

addsd   -72(%rbp), %xmm0

edited Mar 27 '13 at 17:56

answered Mar 27 '13 at 17:43

NPE

486,780
108
951
1,012

2

Ah, so you mean that it has to wait for the store to complete before the load can start? I guess if the cache is write-through that would get very expensive. – Sam Manzer Mar 27 '13 at 17:46
This is a difference between gcc 4.2 and later versions; gcc 4.4 onwards optimizes the temporary stack usage out. – FrankH. Mar 27 '13 at 18:35

C++: Mysteriously huge speedup from keeping one operand in a register

3 Answers3

Linked