x86-64 strange use of stack for local variables

Question

I'm learning x86-64 and I'm working with some compiler generated assembly code which I mostly understand. Its a recursive factorial program which calls itself till a base is reached wherein 1 is placed in rax which in turn is multiplied with each previously decremented count value. I understand alignment in the context of variable access wherein there is a massive cost to accessing unaligned data and I suppose the text segment being aligned is much the same.

In the program, there are two marked points I find confusing the first makes use of one of the three stack-allocated local variable spaces in the decrementing of the rdi register which holds the user provided number to calculate the factorial for. Why not just use rax directly in replacing:

mov qword [rbp + - 16]

with

mov rdi, rax?.

The second is the use of the other two stack local variables in performing each factorial multiplication and subsequently doing what seems to be a redundant operation where the result of the multiplication is moved into a local variable from rax and then back into rax before the function returns.

mov qword [rbp + -24], rax                                                                                                                             
mov rax, rdi                                                                                                                                                      
imul rax, qword [rbp + -24]                                                                                                                                   
mov qword [rbp + -8], rax                                                                                                     
mov rax, qword [rbp + -8]

Would these calculations not be much faster utilizing any of the untouched general purpose registers and omitting these stack locals or are these operations a part of the 16-byte alignment?

rec:
  push rbp                                                                                                                                                                      
  mov rbp, rsp                                                                                                                                                              
  sub rsp, 24                                                                                                                 
  push rbx                                                                                                                                                                           
  push r12
  push r13
  push r14
  push r15
.sec0:
  mov qword [rbp + -8], 1                                                                                                                              
  test rdi, rdi                                                                                                                               
  je .sec1                                                                                                                                                          
.sec2:
  mov rax, rdi                                                                                                                                                                  
  sub rax, 1                                                                                                                                                              
  mov qword [rbp + -16], rax  ;; point 1.0                                                                                                                                               
  push rcx                                                                                                                                                                       
  push rdx
  push rsi
  push rdi
  push r8
  push r9
  push r10
  push r11
  mov rdi, qword [rbp + -16]  ;; point 1.1                                                                                                                  
  call rec                                                                                                                                                           
  pop r11
  pop r10
  pop r9
  pop r8
  pop rdi
  pop rsi
  pop rdx
  pop rcx
  mov qword [rbp + -24], rax   ;; point 2.0                                                                                                                           
  mov rax, rdi                                                                                                                                                    
  imul rax, qword [rbp + -24]  ;; point 2.1                                                                                                                                   
  mov qword [rbp + -8], rax    ;; point 2.2
  mov rax, qword [rbp + -8]    ;; point 2.3                                                                                   
  pop r15
  pop r14
  pop r13
  pop r12
  pop rbx
  leave
  ret
.sec1:
  mov rax, qword [rbp + -8]
  pop r15
  pop r14
  pop r13
  pop r12
  pop rbx
  leave
  ret

What/who generated this code? Is this the output of a professor's compiler? We recently had a question about some unusual looking code for a different question but it made me curious if it may be related: https://stackoverflow.com/questions/55169234/what-would-be-the-benefit-of-moving-a-register-to-itself-in-x86-64 — Michael Petch, Mar 15 '19 at 19:14
Was this created by Microsoft's compiler? I never see assembly *this* stupid with gcc or clang, even when not optimizing. — EOF, Mar 15 '19 at 19:38
@EOF Perhaps you’re being sarcastic, but MSVC’s `.asm` output is formatted nothing like that. Also, the code it generates is quite good in my tests. — Davislor, Mar 15 '19 at 21:46
@Davislor I wasn't being sarcastic, I just haven't seen a compiler fail this badly yet. I was genuinely asking, so thank you for your comment. This would lead me to suspect it's a hobby or educational compiler's output. — EOF, Mar 15 '19 at 21:54
@EOF Could be an old version of GCC, or even hand-written as a simple example. — Davislor, Mar 15 '19 at 22:12
@Davislor It can't be too old, since it's targeting x86-64 (~2003-ish?), and the question starts with a sentence about "compiler generated". — EOF, Mar 15 '19 at 22:33

Davislor · Answer 1 · 2019-03-16T01:19:55.543

You don’t say what code that example was generated from, or on what compiler, but it must be very crude, maybe even some toy compiler from an undergrad compiler class. You’re right that that’s extremely sub-optimal. Even the oldest version of gcc I tested with, with all optimizations off, doesn’t produce code that bad. Let’s look at what we get when we compile with a few different compilers. A good way to compare is over at godbolt.

I tested the following code:

unsigned long long factorial(const unsigned long long n)
{
  return (n <= 1) ? 1
                  : n*(factorial(n-1));
}

The factorial() function is the simple, one-line recursive implementation you describe. I also wrote factorial_tail(), a tail-recursive version with an accumulator, to make it easier for some compilers to notice that the function is tail-recursive modulo an associative operation, and therefore can be automatically transformed into a tight loop.

Modern compilers, though, are generally pretty smart about this.

With no optimizations other than -fomit-frame-pointer (to suppress saving and restoring stack frames), this is what gcc 8.2 does:

factorial:
        sub     rsp, 24
        mov     QWORD PTR [rsp+8], rdi
        cmp     QWORD PTR [rsp+8], 1
        jbe     .L2
        mov     rax, QWORD PTR [rsp+8]
        sub     rax, 1
        mov     rdi, rax
        call    factorial
        imul    rax, QWORD PTR [rsp+8]
        jmp     .L4
.L2:
        mov     eax, 1
.L4:
        add     rsp, 24
        ret

You can still see the function save the intermediate result on the stack, immediately above the 8-byte return address, and do some unnecessary copying to and from the stack. The purpose of this is so that, when debugging, the temporary value exists at a discrete memory location and can be watched, inspected and modified.

You ask, “Would these calculations not be much faster utilizing any of the untouched general purpose registers and omitting these stack locals [...]?” Good thinking! Indeed it would! You can’t just save every factor of the factorial in a different register, because there could be billions and billions. But you can automatically refactor the code until you need only constant scratch space.

In production code, you would turn optimizations on. For learning purposes, code optimized for space is easier to understand than code fully-optimized for speed, which often is much longer and more complex. With gcc -std=c11 -g -Os -mavx, we get this instead:

factorial:
        mov     eax, 1
.L3:
        cmp     rdi, 1
        jbe     .L1
        imul    rax, rdi
        dec     rdi
        jmp     .L3
.L1:
        ret

GCC is smart enough to figure out that, because multiplication is associative and has an identity, (4 × (3 × (2 × 1))) = 1 × 4 × 3 × 2 × 1. Therefore, it can keep a running total of the product from left to right (4, then 12, then 24) and eliminate the call entirely. That code is just a tight loop, almost identical to what you would get if you wrote a for loop in a high-level language.

If you optimized for time instead of space with -O3, GCC would try to vectorize the loop, depending on whether you gave it a flag such as -mavx. The other compilers on maximum optimization unroll the loop but do not use vector instructions.

Clang 7.0.0 produces slightly-faster code one instruction longer with the same flags, as it knows enough to check whether to terminate the loop at the end, not jump back and then check at the start. I would prefer this code slightly to GCC’s.

factorial:                              # @factorial
        mov     eax, 1
        cmp     rdi, 2
        jb      .LBB0_2
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        imul    rax, rdi
        dec     rdi
        cmp     rdi, 1
        ja      .LBB0_1
.LBB0_2:
        ret

MSVC 19.0 cannot figure out to apply that transformation to that code, and still generates recursive code with call, but we can give it a hint by refactoring, and adding an explicit accumulator parameter:

unsigned long long factorial_tail(const unsigned long long n,
                                  const unsigned long long p)
/* The n parameter is the current value counted down, and the p parameter
 * is the accumulating product.  Call this function with an initial value
 * of p = 1.
 */
{
  return (n <= 1) ? p
                  : factorial_tail( n-1, n*p );
}

This version is explicitly tail-recursive, and every modern compiler knows about tail-call elimination. This compiles with /Ox /arch:avx to:

factorial_tail PROC
        mov     rax, rdx
        cmp     rcx, 1
        jbe     SHORT $LN4@factorial_
        mov     rdx, rcx
        imul    rdx, rax
        dec     rcx
        jmp     factorial_tail
$LN4@factorial_:
        ret     0

You observe in a different code listing, “what seems to be a redundant operation where the result of the multiplication is moved into a local variable from rax and then back into rax before the function returns.” Indeed this does, in every iteration of the loop. It doesn’t realize that, having already put the running product to rax, it can and should just leave it there.

Intel’s compiler 19.0.1 also cannot tell that it can transform factorial() into a loop, but it can with factorial_tail(). With -std=c11 -g -avT -Os, this produces code better than MSVC and very similar to clang:

factorial_tail:
        cmp       rdi, 1                                        #14.16
        jbe       ..B2.5        # Prob 12%                      #14.16
..B2.3:                         # Preds ..B2.1 ..B2.3
        imul      rsi, rdi                                      #15.44
        dec       rdi                                           #15.39
        cmp       rdi, 1                                        #14.16
        ja        ..B2.3        # Prob 88%                      #14.16
..B2.5:                         # Preds ..B2.3 ..B2.1
        mov       rax, rsi                                      #14.16
        ret

It realizes it should avoid copying values from one register to another and back between iterations of the loop. It instead chooses to keep it in its initial location, rsi (the second function parameter) and moves the return value to rax only once, at the end.

x86-64 strange use of stack for local variables

1 Answers1