I have the following code:
#include <iostream>
#include <chrono>
#define ITERATIONS "10000"
int main()
{
/*
======================================
The first case: the MOV is outside the loop.
======================================
*/
auto t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $200, %ebx\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time1:\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time1\n");
auto t2 = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << time;
/*
======================================
The second case: the MOV is inside the loop (faster).
======================================
*/
t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time2:\n"
" mov $200, %ebx\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time2\n");
t2 = std::chrono::high_resolution_clock::now();
time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << '\n' << time << '\n';
}
The first case
I compiled it with
gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu
gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp
and its output is
14474
5837
I also compiled it with Clang with the same result.
So, why the second case is faster (almost 3x speedup)? Does it actually related with some microarchitectural details? If it matters, I have an AMD's CPU: “AMD A9-9410 RADEON R5, 5 COMPUTE CORES 2C+3G”.