Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?

Question

I have the following code:

#include <iostream>
#include <chrono>

#define ITERATIONS "10000"

int main()
{
    /*
    ======================================
    The first case: the MOV is outside the loop.
    ======================================
    */

    auto t1 = std::chrono::high_resolution_clock::now();

    asm("mov $100, %eax\n"
        "mov $200, %ebx\n"
        "mov $" ITERATIONS ", %ecx\n"
        "lp_test_time1:\n"
        "   add %eax, %ebx\n" // 1
        "   add %eax, %ebx\n" // 2
        "   add %eax, %ebx\n" // 3
        "   add %eax, %ebx\n" // 4
        "   add %eax, %ebx\n" // 5
        "loop lp_test_time1\n");

    auto t2 = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();

    std::cout << time;

    /*
    ======================================
    The second case: the MOV is inside the loop (faster).
    ======================================
    */

    t1 = std::chrono::high_resolution_clock::now();

    asm("mov $100, %eax\n"
        "mov $" ITERATIONS ", %ecx\n"
        "lp_test_time2:\n"
        "   mov $200, %ebx\n"
        "   add %eax, %ebx\n" // 1
        "   add %eax, %ebx\n" // 2
        "   add %eax, %ebx\n" // 3
        "   add %eax, %ebx\n" // 4
        "   add %eax, %ebx\n" // 5
        "loop lp_test_time2\n");

    t2 = std::chrono::high_resolution_clock::now();
    time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    std::cout << '\n' << time << '\n';
}

The first case

I compiled it with

gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu

gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp

and its output is

14474
5837

I also compiled it with Clang with the same result.

So, why the second case is faster (almost 3x speedup)? Does it actually related with some microarchitectural details? If it matters, I have an AMD's CPU: “AMD A9-9410 RADEON R5, 5 COMPUTE CORES 2C+3G”.

Peter Cordes · Accepted Answer · 2019-11-16T00:07:09.117

mov $200, %ebx inside the loop breaks the loop-carried dependency chain through ebx, allowing out-of-order execution to overlap the chain of 5 add instructions across multiple iterations.

Without it, the chain of add instructions bottlenecks the loop on the latency of the add (1 cycle) critical path, instead of the throughput (4/cycle on Excavator, improved from 2/cycle on Steamroller). Your CPU is an Excavator core.

AMD since Bulldozer has an efficient loop instruction (only 1 uop), unlike Intel CPUs where loop would bottleneck either loop at 1 iteration per 7 cycles. (https://agner.org/optimize/ for instruction tables, microarch guide, and more details on everything in this answer.)

With loop and mov taking slots in the front-end (and back-end execution units) away from add, a 3x instead of 4x speedup looks about right.

See this answer for an intro to how CPUs find and exploit Instruction Level Parallelism (ILP).

See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for some in-depth details about overlapping independent dep chains.

BTW, 10k iterations is not many. Your CPU might not even ramp up out of idle speed in that time. Or might jump to max speed for most of the 2nd loop but none of the first. So be careful with microbenchmarks like this.

Also, your inline asm is unsafe because you forgot to declare clobbers on EAX, EBX, and ECX. You step on the compiler's registers without telling it. Normally you should always compile with optimization enabled, but your code would probably break if you did that.

Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?

1 Answers1

Linked