ARM assembly: auto-increment register on store

Question

Is it possible to auto-increment the base address of a register on a STR with a [Rn]!? I've peered through the documentation but haven't been able to find a definitive answer, mainly because the command syntax is presented for both LDR and STR - in theory it should work for both, but I couldn't find any examples of auto-incrementing on a store (the loading works ok).

I've made a small program which stores two numbers in a vector. When it's done the contents of out should be {1, 2} but the store overwrites the first byte, as if the auto-increment isn't working.

#include <stdio.h>

int main()
{
        int out[]={0, 0};
        asm volatile (
        "mov    r0, #1          \n\t"
        "str    r0, [%0]!       \n\t"
        "add    r0, r0, #1      \n\t"
        "str    r0, [%0]        \n\t"
        :: "r"(out)
        : "r0" );
        printf("%d %d\n", out[0], out[1]);
        return 0;
}

EDIT: While the answer was right for regular loads and stores, I found that the optimizer messes up auto-increment on vector instructions such as vldm/vstm. For instance, the following program

#include <stdio.h>

int main()
{
        volatile int *in = new int[16];
        volatile int *out = new int[16];

        for (int i=0;i<16;i++) in[i] = i;

        asm volatile (
        "vldm   %0!, {d0-d3}            \n\t"
        "vldm   %0,  {d4-d7}            \n\t"
        "vstm   %1!, {d0-d3}            \n\t"
        "vstm   %1,  {d4-d7}            \n\t"
        :: "r"(in), "r"(out)
        : "memory" );

        for (int i=0;i<16;i++) printf("%d\n", out[i]);
        return 0;
}

compiled with

g++ -O2 -march=armv7-a -mfpu=neon main.cpp -o main

will produce gibberish on the output of the last 8 variables, because the optimizer is keeping the incremented variable and using it for the printf. In other words, out[i] is actually out[i+8], so the first 8 printed values are the last 8 from the vector and the rest are memory locations out of bounds.

I've tried with different combinations of the volatile keyword throughout the code, but the behavior changes only if I compile with the -O0 flag or if I use a volatile vector instead of a pointer and new, like

volatile int out[16];

slide 44, I think it can but No way to test right now.. http://simplemachines.it/doc/arm_inst.pdf use of the ! operator — L7ColWinters, Feb 01 '12 at 19:52

old_timer · Accepted Answer · 2012-02-02T16:15:57.420

6

For store and load you do this:

ldr r0,[r1],#4
str r0,[r2],#4

whatever you put at the end, 4 in this case, is added to the base register (r1 in the ldr example and r2 in the str example) after the register is used for the address but before the instruction has completed it is very much like

unsigned int a,*b,*c;
...
a = *b++;
*c++ = a;

EDIT, you need to look at the disassembly to see what is going on, if anything. I am using the latest code sourcery or now just sourcery lite from mentor graphics toolchain.

arm-none-linux-gnueabi-gcc (Sourcery CodeBench Lite 2011.09-70) 4.6.1

#include <stdio.h>
int main ()
{
        int out[]={0, 0};
        asm volatile (
        "mov    r0, #1          \n\t"
        "str    r0, [%0], #4       \n\t"
        "add    r0, r0, #1      \n\t"
        "str    r0, [%0]        \n\t"
        :: "r"(out)
        : "r0" );
        printf("%d %d\n", out[0], out[1]);
        return 0;
}


arm-none-linux-gnueabi-gcc str.c -O2  -o str.elf

arm-none-linux-gnueabi-objdump -D str.elf > str.list


00008380 <main>:
    8380:   e92d4010    push    {r4, lr}
    8384:   e3a04000    mov r4, #0
    8388:   e24dd008    sub sp, sp, #8
    838c:   e58d4000    str r4, [sp]
    8390:   e58d4004    str r4, [sp, #4]
    8394:   e1a0300d    mov r3, sp
    8398:   e3a00001    mov r0, #1
    839c:   e4830004    str r0, [r3], #4
    83a0:   e2800001    add r0, r0, #1
    83a4:   e5830000    str r0, [r3]
    83a8:   e59f0014    ldr r0, [pc, #20]   ; 83c4 <main+0x44>
    83ac:   e1a01004    mov r1, r4
    83b0:   e1a02004    mov r2, r4
    83b4:   ebffffe5    bl  8350 <_init+0x20>
    83b8:   e1a00004    mov r0, r4
    83bc:   e28dd008    add sp, sp, #8
    83c0:   e8bd8010    pop {r4, pc}
    83c4:   0000854c    andeq   r8, r0, ip, asr #10

so the

sub sp, sp, #8

is to allocate the two local ints out[0] and out[1]

mov r4,#0
str r4,[sp]
str r4,[sp,#4]

is because they are initialized to zero, then comes the inline assembly

8398:   e3a00001    mov r0, #1
839c:   e4830004    str r0, [r3], #4
83a0:   e2800001    add r0, r0, #1
83a4:   e5830000    str r0, [r3]

and then the printf:

83a8:   e59f0014    ldr r0, [pc, #20]   ; 83c4 <main+0x44>
83ac:   e1a01004    mov r1, r4
83b0:   e1a02004    mov r2, r4
83b4:   ebffffe5    bl  8350 <_init+0x20>

and now it is clear why it didnt work. you are didnt declare out as volatile. You gave the code no reason to go back to ram to get the values of out[0] and out[1] for the printf, the compiler knows that r4 contains the value for both out[0] and out[1], there is so little code in this function that it didnt have to evict r4 and reuse it so it used r4 for the printf.

If you change it to be volatile

    volatile int out[]={0, 0};

Then you should get the desired result:

83a8:   e59f0014    ldr r0, [pc, #20]   ; 83c4 <main+0x44>
83ac:   e59d1000    ldr r1, [sp]
83b0:   e59d2004    ldr r2, [sp, #4]
83b4:   ebffffe5    bl  8350 <_init+0x20>

the preparation for printf reads from ram.

edited Feb 02 '12 at 16:15

answered Feb 01 '12 at 20:20

old_timer

69,149
8
89
168

using ! is for ldm/stm instructions, the amount to increment or decrement is determined by the number of registers in the register list (not enough room in the instruction for a constant). the ldr/str has room for a constant. – old_timer Feb 01 '12 at 20:23
for that matter you can use stm, stmia r0!,{r1} will store r1 at address r0 and then "increment after" (ia) the store. stmia increment after, stmdb decrement before, stmda decrement after, stmib increment before. – old_timer Feb 01 '12 at 21:13
While the solution works, I found that the output of the program is {0, 0} for the -O2 compiler flag. Any thoughts as to why this is happening? I am using the values in the printf function, and I don't get any warnings at compile-time. – John S Feb 02 '12 at 11:58
Oh I see, thanks for the explanation. So I have to declare all variables that are used in a volatile environment as volatile themselves? What if I then use them in ordinary c++ code, do you think the code will be slow (in an average use case) due to the compiler not optimizing memory access to them? – John S Feb 02 '12 at 17:10
Ok, I've just figured out that I can add "memory" to the clobber list and get the same results as with the volatile keyword. All is well now :) – John S Feb 02 '12 at 17:35
The problem is that you were using the assumption that declaring variables means they live in memory, the optimizer tries to avoid memory where possible/practical. Then you snuck around behind the compiler and manipulated memory. So either solution, tell the compiler that out is in memory that is changing (volatile) or tell the compiler every variable declared has been modified (clobber). Or does the clobber list allow you to specify what memory was modified so you can target the out array specifically? – old_timer Feb 02 '12 at 17:50
As far as I've seen, a specific memory location such as an array cannot be targeted except if I declare it as volatile. Putting "memory" in the clobber list tells the compiler not to use cached memory locations (any) from the registers. – John S Feb 03 '12 at 09:50
Well, all this worked until I tried to do the same thing with NEON vstr/vstm instructions. I increment the register with ! (since this is the way the reference manual says it works) but the modified register is kept by the compiler when the -O2 flag is set. I tried with "memory" and the volatile keyword but the behavior is the same. Could this be a compiler bug or is there some other thing I missed? – John S Feb 07 '12 at 11:17
I dont know the neon instructions very well. I also never use inline assembler I use raw/straight assembler and link it in. You are probably one step ahead of me on neon. – old_timer Feb 07 '12 at 14:27
Ok, for the last couple of days gcc has driven me crazy, the optimizer jumbles up everything, making the binary unusable and debugging extremely difficult. Can you please tell me an "easier" way of putting optimized code inside my program? Do you know of any tutorials for this? I heard that raw assembly files are more tricky than inline, you have to save/load the stack or something like that. Also, when you profile your code what do you use to see how many cycles does it take? I'm currently using a simple timer, but it gives quite random results. Thanks again for your explanations! – John S Feb 09 '12 at 17:41
what is it that you are really trying to do, what is it that the compiler isnt doing that warrants assembly in the first place? Inline assembly is extremely compiler specific and as you are finding out just learning how to beat one compiler into submission for one task is a challenge. I prefer more portable code that relies as little on the specific compiler as possible. Determine there is a need to do the hand optimization before doing it. It sounds like you are operating on a lot of assumptions that might not be valid. – old_timer Feb 09 '12 at 18:20
to use straight asm you do need to know the calling convention at least a little, but that is not hard to figure out because you can write C code and examine what the compiler produces to see how the convention works. The arm uses register passing for the first few words worth of arguments and a register to return so depending on what you are doing you might not need the stack at all. Yes because it is basically another function you might lose some clocks in the preparation to make the call, there is a similar risk with inlining though. – old_timer Feb 09 '12 at 18:23
http://github.com/dwelch67/stm32f4d I dabbled a little with neon/float instructions. Bear in mind that was on a cortex-m4 (does not support arm instructions, thumb/thumb2 only), and I was simply trying a divide or something like that which is more painful with fixed point. Knowing the bigger task trying to be optimized you can if need be hand code a larger part, or take the compiler generated code and hand modify it to improve it. – old_timer Feb 09 '12 at 18:26
Profiling is like benchmarking, it is easier to get it wrong or misinterpret the results than it is to get it right and isolate the real issues. re-arranging a few lines of C or splitting or joining functions can produce several times performance gains (or losses). once caches get involved you need lots of performance experience to understand the results and what the next step is. If performance is important for the task though you have to learn sometime/somewhere. – old_timer Feb 09 '12 at 18:29
I started with the zen of assembly language by Michael Abrash back when you could actually buy that book on the shelf at the store new. And have been practicing performance optimizations ever since (20 years or so). I just pulled that book out a few weeks ago and looked at it and what I learned then from that book I still use every day, not the 8088/86 details but the thought processes. No matter how good you think you are at hand optimization, you need to time your code, at the same time you need to know how to time the code and interpret the results. – old_timer Feb 09 '12 at 18:32
Wow, thank you so much for the replies, I'll try and keep those in mind. I'm currently working on an OMAP 4430 pandaboard on which I'm running image processing algorithms. Because of the large image sizes involved, it's quite slow, so I'm trying to optimize them. The bad thing is that after spending a week on re-writing code for NEON, the application runs exactly the same (slow) as before, so I guess I'll have to change the algorithm itself to be more suitable for the NEON pipeline or try and find other resources from the board with which to split the workload. – John S Feb 13 '12 at 18:27
1

From my Zen of assembly roots, and a little common sense. Are you sure you are targeting the proper performance problem? The problem may be the moving of data around and might not be how fast you can compute it. A race car crew might be able to change the tires and gas up in a matter of seconds but of the car only goes 20 miles an hour max it doesnt matter, you wont notice how fast the pit stop was. is it possible to pass the data through unprocessed and see if it is the copy/I/O and not the processing? – old_timer Feb 13 '12 at 18:43
if nothing else, by measuring the copy/I/O without processing you get a feel for the max theoretical speed if the processing was infinitely fast. If you can move 10MBps without processing you cant expect to be able to process and move faster than 10MBps without fixing the I/O/copy paths. yes the cache messes with the numbers and every time you touch the code it changes where things are in the cache messing again with the numbers... – old_timer Feb 13 '12 at 18:46

score 0 · Answer 2 · answered Jul 26 '14 at 08:10

GCC inline assembler requires that all modified registers and non-volatile variables are listed as outputs or clobbers. In the second example GCC may and does assume that the registers allocated to in and out do not change.

A correct approach would be:

out_temp = out;
asm volatile ("..." : "+r"(in), "+r"(out_temp) :: "memory" );

score 0 · Answer 3 · answered Jan 02 '17 at 14:03

I found this question while searching for the answer for a similar question: How to bind an input/output register. The GCC documentation of the inline assembler constrants says that the + prefix in the input register list designates an input/output register.

In the example, it seems to me that you would prefer to preserve the original value of the variable out. Nevertheless, if you want to use the post-increment (!) variant of the instructions, I think that you should declare the parameters as read/write. The following worked on my Raspberry Pi 2:

#include <stdio.h>

int main()
{
  int* in = new int(16);
  volatile int* out = new int(16);

  for (int i=0; i<16; i++) in[i]=i;

  asm volatile(
    "vldm %0!, {d0-d3}\n\t"
    "vldm %0, {d4-d7}\n\t"
    "vstm %1!, {d0-d3}\n\t"
    "vstm %1, {d4-d7}\n\t"
    :"+r"(in), "+r"(out) :: "memory");

  for (int i=0; i<16; i++) printf("%d\n", out[i-8]);
  return 0;
}

In this way, the semantics of the code is clear to the compiler: both the in and out pointers will be changed (incremented by 8 elements).

Disclaimer: I do not know if the ARM ABI allows a function to freely clobber the NEON registers d0 through d7. In this simple example it probably does not matter.

ARM assembly: auto-increment register on store

3 Answers3

Linked