8

I tried to write a simple test code like this(main.c):

main.c
void test(){
}
void main(){
    test();
}

Then I used arm-non-eabi-gcc to compile and objdump to get the assembly code:

arm-none-eabi-gcc -g -fno-defer-pop -fomit-frame-pointer -c main.c
arm-none-eabi-objdump -S main.o > output

The assembly code will push r3 and lr registers, even the function did nothing.

main.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <test>:
void test(){
}
   0:   e12fff1e        bx      lr

00000004 <main>:
void main(){
   4:   e92d4008        push    {r3, lr}
        test();
   8:   ebfffffe        bl      0 <test>
}
   c:   e8bd4008        pop     {r3, lr}
  10:   e12fff1e        bx      lr

My question is why arm gcc choose to push r3 into stack, even test() function never use it? Does gcc just random choose 1 register to push? If it's for the stack aligned(8 bytes for ARM) requirement, why not just subtract the sp? Thanks.

==================Update==========================

@KemyLand For your answer, I have another example: The source code is:

void test1(){
}
void test(int i){
        test1();
}
void main(){
        test(1);
}

I use the same compile command above, then get the following assembly:

main.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <test1>:
void test1(){
}
   0:   e12fff1e        bx      lr

00000004 <test>:
void test(int i){
   4:   e52de004        push    {lr}            ; (str lr, [sp, #-4]!)
   8:   e24dd00c        sub     sp, sp, #12
   c:   e58d0004        str     r0, [sp, #4]
        test1();
  10:   ebfffffe        bl      0 <test1>
}
  14:   e28dd00c        add     sp, sp, #12
  18:   e49de004        pop     {lr}            ; (ldr lr, [sp], #4)
  1c:   e12fff1e        bx      lr

00000020 <main>:
void main(){
  20:   e92d4008        push    {r3, lr}
        test(1);
  24:   e3a00001        mov     r0, #1
  28:   ebfffffe        bl      4 <test>
}
  2c:   e8bd4008        pop     {r3, lr}
  30:   e12fff1e        bx      lr

If push {r3, lr} in first example is for use less instructions, why in this function test(), the compiler didn't just using one instruction?

push {r0, lr}

It use 3 instructions instead of 1.

push {lr}
sub sp, sp #12
str r0, [sp, #4]

By the way, why it sub sp with 12, the stack is 8-bytes aligned, it can just sub it with 4 right?

Alan
  • 288
  • 4
  • 7
  • to insure stack alignment, been asked and answered here a number of times – old_timer Sep 17 '15 at 19:41
  • 1
    @auselen: I don't think it is a duplicate question, I am fine with stack pointer moved down 4 bytes, my question is why push r3, r3 is just a temporary register, why not push r2, r1 or other register? The link you provides above didn't answer my question. – Alan Sep 17 '15 at 20:35
  • so your question is why r3, instead of any other register? you understand that you need to push lr, in a non-leaf function and you need to keep stack at 8 byte alignment? – auselen Sep 18 '15 at 12:06
  • Your second example is different question entirely - preserving callee-saved registers (the initial push), is a separate thing from creating a local stack frame (then stashing a copy of the argument in it). – Notlikethat Sep 19 '15 at 15:32
  • @auselen Yes, I understand push lr, but I didn't understand why push r3, to push r3 you need to access r3 register, then subtract sp right? why don't use "push {lr}; sub sp, sp #4" instead? why cpu waste time to access an unused register? – Alan Sep 19 '15 at 20:29
  • @Notlikethat, I don't think r3 is a callee-saved register, r3 is a scratch register, subroutine do not need to keep the value in r3. – Alan Sep 19 '15 at 20:34
  • @alan Yes, but take it in the context of the answer provided. That first instruction of the prologue is used to preserve callee-saved registers _as necessary_, in this case only `lr`. That's not to say it can't _also_ push a junk register to maintain alignment across the leaf call in the case where it _doesn't_ need to otherwise touch SP immediately afterwards to create space for a stack frame. On anything reasonably modern, stuffing an extra word into the write buffer is likely to be a lot quicker than the pipeline stall from dependent back-to-back modifications of the same register. – Notlikethat Sep 19 '15 at 23:36
  • 2
    That said, reasoning about the assembly code generated by a compiler with optimisation turned off is generally a pretty futile exercise, because the _only_ requirement of the generated code is that it produces the correct result. My favourite GCC -O0 idiom is "save a value to the stack between statements, immediately reload it into the exact same register, then never use it again"; you really have to go to at least -O1 if you want things to make any kind of sense. – Notlikethat Sep 19 '15 at 23:51
  • push lr, r3 which should be "stmia sp! {r3, lr}" is a single instruction. It saves lr and keeps stack aligned by adding r3 to the list. Instruction is capable of storing a list of registers to an address and also update the register keeping the address – auselen Sep 20 '15 at 00:39

1 Answers1

8

According to the Standard ARM Embedded ABI, r0 through r3 are used to pass the arguments to a function, and the return value thereof, meanwhile lr (a.k.a: r14) is the link register, whose purpose is to hold the return address for a function.

It's obvious that lr must be saved, as otherwise main() would have no way to return to its caller.

It's now notorious to mention that every single ARM instruction takes 32 bits, and as you mentioned, ARM has a call stack alignment requirement of 8 bytes. And, as a bonus, we're using the Embedded ARM ABI, so code size shall be optimized. Thus, it's more efficient to have a single 32-bit instruction both saving lr and aligning the stack by pushing an unused register (r3 is not needed, because test() does not take arguments nor it returns anything), and then pop in a single 32-bit instruction, rather than adding more instructions (and thus, wasting precious memory!) to manipulate the stack pointer.

After all, it's pretty logical to conclude this is just an optimization from GCC.

3442
  • 8,248
  • 2
  • 19
  • 41