ARM Link register - non-leaf subroutine

Question

I am wondering about, where the Link register is used in ARM CPU. As I understand it is storing return address of functions. But does every return address go to this register after function call or it is only related to leaf subroutine implementation? How it is performed in functions, that have to use stack (for storing data or additional return addresses) - is LR still used here in any way?

See also: https://stackoverflow.com/questions/34091898/bl-instruction-arm-how-does-it-work?rq=1 , and https://stackoverflow.com/questions/60927163/does-the-arm-calling-convention-allow-a-function-to-not-store-lr-to-the-stack?rq=1 — Erik Eidt, Jan 06 '22 at 23:46

Erik Eidt · Answer 1 · 2022-01-07T02:16:20.973

The return address is a parameter, which is hidden from C and other high level languages, but visible in assembly & machine code.

On ARM, called functions (callees) rely on the return address being passed in the lr register; this by specification of the calling convention — so callers using the standard ARM calling convention must put the return address there, in lr, to satisfy this parameter passing requirement and expectation.

Standard calling conventions are designed so that a caller can properly invoke a callee — knowing nothing more about the callee than its function signature. Thus, a caller is abstracted from knowing the implementation details of the callee, beyond the parameters and return value(s). This means that a function can evolve (for bug fixes, or other) without having to revisit (or recompile) callers, as long as the signature is unmodified.

Whether a function is a leaf function (or not) is an aspect of its internal implementation and not visible in function signatures. So, a caller does not know (and should not have to know) if the callee is leaf or not, or if a function changes from leaf to non-leaf during some bug fix versioning.

A function's use of the stack is also an internal implementation detail not captured in function signature, so will not affect how functions are called, and where return address value is passed and expected.

So, there really is only one way to pass the return address.

Callee's who need to use the lr (because they are calling other functions or maybe just want to use that register) will need to preserve the return address value provided to them as a parameter by their callers, for use later to return to them (assuming they want to return).

Function implementations that use the lr (and so preserve the value therein for their later use) don't have to restore that preserved return address value back into the lr register (the calling convention does not require the return address to be passed back to callers) so sometimes the lr register is restored then used, but other times on ARM, the return address is popped directly off the stack into the program counter, bypassing lr, i.e. without restoring the lr register.

You could create your own calling convention that passed the return address in a different location, i.e. in a different register, or, by pushing it onto the stack!

Some languages do diverge from the standard calling conventions (in minor ways) and then still support the standard calling conventions for their interoperability with C-style functions.

The hardware is designed to support collecting the return address into the lr register while making the call that transfers control to the callee, all in one instruction, so it would be silly to avoid that. The hardware also offers no other particularly efficient way to capture the return address, and there is no real reason for it to.

score 0 · Accepted Answer · edited Jan 09 '22 at 12:39

BL instruction

Operation
  if ConditionPassed(cond) then
  LR = address of the instruction after the branch instruction
  PC = PC + (SignExtend(signed_immed_24) << 2)

Usage
  The BL instruction is used to perform a subroutine call. The return
  from subroutine is achieved by copying the LR to the PC. Typically, 
  this is done by one of the following methods:
  - Executing a BX R14 instruction.
  - Executing a MOV PC,R14 instruction.

And newer ARMs go on to allow for pop {lr} and other...

Seems quite clear to me what the usage of LR is.

You can easily try it yourself as well:

unsigned int more_fun ( unsigned int );
unsigned int fun0 ( unsigned int x )
{
    return(x+1);
}
unsigned int fun1 ( unsigned int x )
{
    return(more_fun(x)+1);
}
unsigned int fun2 ( unsigned int x )
{
    return(more_fun(x));
}
unsigned int fun3 ( unsigned int x )
{
    return(3);
}

00000000 <fun0>:
   0:   e2800001    add r0, r0, #1
   4:   e12fff1e    bx  lr

00000008 <fun1>:
   8:   e92d4010    push    {r4, lr}
   c:   ebfffffe    bl  0 <more_fun>
  10:   e8bd4010    pop {r4, lr}
  14:   e2800001    add r0, r0, #1
  18:   e12fff1e    bx  lr

0000001c <fun2>:
  1c:   e92d4010    push    {r4, lr}
  20:   ebfffffe    bl  0 <more_fun>
  24:   e8bd4010    pop {r4, lr}
  28:   e12fff1e    bx  lr

0000002c <fun3>:
  2c:   e3a00003    mov r0, #3
  30:   e12fff1e    bx  lr

Because, as documented, bl modifies the link register. In order to return from a non-leaf function you need to preserve the link register for that call, the return address. So you push it on the stack. The convention for this compiler wants the stack 64 bit aligned, so the addition of the r4 register is simply to facilitate that alignment and r4 is otherwise not involved here.

You can see in the leaf function it does not use the stack because it has no reason to do so, the link register does not get modified during the function and in this case the function is too simple to need the stack for other reasons. If you were to need the stack and be a leaf function the optimizer will not need to put lr on the stack, but if for alignment reasons it needs another register, who knows they are free to use r14 as well as one of many of the other registers.

Now if we force something on the stack (non-leaf)

unsigned int new_fun ( unsigned int, unsigned int );
unsigned int fun4 ( unsigned int x, unsigned int y)
{
    return(new_fun(x,y)+y);
}

00000034 <fun4>:
  34:   e92d4010    push    {r4, lr}
  38:   e1a04001    mov r4, r1
  3c:   ebfffffe    bl  0 <new_fun>
  40:   e0800004    add r0, r0, r4
  44:   e8bd4010    pop {r4, lr}
  48:   e12fff1e    bx  lr

lr has to be on the stack because a bl is used to call the next function. In this case per the convention they chose to use r4 to save the y variable (in r1 coming in) so that it can be used after the return of the nested call. Since only two registers need to be preserved, and that fits with the stack alignment rule then r4 and lr are saved and in this case both are used (r4 is not just to align the stack).

Not sure what you mean by additional return addresses. Perhaps you are thinking as each function makes a call there a return address on the stack to preserve that address, and that is true but you really only need to look at it one function at a time, that is the beauty of calling conventions. And in that case for this architecture using ideally bl to make function calls (as pointed out in another answer they don't have to, but it would be silly not to) that means lr is modified for every call to a subroutine and as a result the calling function then loses its return address to its caller, so it needs to preserve it locally some how. As we saw with fun 4, technically they could for example:

fun2:
 push {r4, r5}
 mov r5,lr
 bl 0 <more_fun>
 mov r1,r5
 pop {r4, r5}
 bx r1

and not actually save lr on the stack. Newer ARMs than the one I am building for you will see this

00000008 <fun1>:
   8:   e92d4010    push    {r4, lr}
   c:   ebfffffe    bl  0 <more_fun>
  10:   e2800001    add r0, r0, #1
  14:   e8bd8010    pop {r4, pc}

00000018 <fun2>:
  18:   eafffffe    b   0 <more_fun>

The contents of lr is on the stack (lr itself of course is a register it can't be "on the stack", but after armv4t you can pop to the pc and change modes between arm and thumb (where before only bx could be used for thumb interwork).

Also note the tail optimization for fun2. This means that fun2 did not even push the return address on the stack.

Seems pretty obvious if you look at the arm docs how lr is used. And then think about how a compiler would implement a standard function, and then what optimizations they might do. And of course you can then just try it and see what certain compilers actually generate.

ARM Link register - non-leaf subroutine

2 Answers2