Copy a byte to another register in GNU C inline asm, where the compiler chooses registers for both operands

Question

I'm trying to mess around with strings in inline asm for c. I was able to understand how strcpy works (shown below):

static inline char *strcpy(char *dest, char *src)
{
  int d0, d1, d2;
  char temp;

  asm volatile(
    "loop:  lodsb;"   /* load value pointed to by %si into %al, increment %si */
    "       stosb;"   /* move %al to address pointed to by %di, increment %di */
    "       testb %%al, %%al;"
    "       jne loop;"
    : "=&S" (d0), "=&D" (d1), "=&a" (d2)
    : "0" (src), "1" (dest)
    : "memory"
  );
}

I'm trying to use this structure to make it so I can modify individual characters of the string before returning them. As a result I'm attempting something that looks like:

static inline char *strcpy(char *dest, char *src)
{
  int d0, d1, d2;
  char temp;

  asm volatile(
    "loop:  lodsb;"   /* load value pointed to by %si into %al, increment %si */
    "       mov %2, %3;" /* move al into temp */
    /*
     *
     * Do and comparisons and jumps based off how I want to change the characters
     *
     */
    "       stosb;"   /* move %al to address pointed to by %di, increment %di */
    "       testb %%al, %%al;"
    "       jne loop;"
    : "=&S" (d0), "=&D" (d1), "=&a" (d2), "+r" (temp)
    : "0" (src), "1" (dest)
    : "memory"
  );
}

Where I'm basically moving the byte put into %al by the lodsb instruction into a temp variable where I do any processing after. However, it seems that the character is never actually stored in temp for some reason I cannot figure out.

BTW, your first version of `strcpy` looks safe and correct, nice. You could simplify with `"+S"(src)` because you're inside a wrapper function (so it's ok to modify this function's local `src`) but dummy output with matching constraint is the canonical way to take an input in a register you want to destroy. Of course it's not particularly *efficient* to use lodsb/stosb in a loop, especially on CPUs that don't rename AL separately from RAX so every load also needs an ALU uop to merge into RAX. But byte-at-a-time is much worse than you can do with SSE2 so let's ignore performance. — Peter Cordes, Feb 18 '20 at 05:08

score 2 · Answer 1 · answered Feb 18 '20 at 05:31

Your 2nd version won't even assemble because temp and d2 are different sizes, so you end up with mov %eax, %dl from GCC: https://godbolt.org/z/tng4g4. When inline asm doesn't do what you want, always look at the compiler-generated asm to see what the compiler actually substituted into your template (and what registers it picked for which operand).

This doesn't match what you describe (runs but doesn't work) so it's not a MCVE of exactly what you were doing. But the real question is still answerable.

One easy way is to declare both C temporaries the same size so GCC picks the same width registers.

Or you can use size overrides like mov %k2, %k3 to get movl or mov %b2, %b3 to get movb (8-bit operand-size).

Strangely you chose int for the "=a" temporary so the compiler picks EAX, even though you only load a char.

I'd actually recommend movzbl %2b, %3k to use the opposite sizes from how you declared the variables; that's more efficient than merging a byte into the low byte of the destination, and avoids introducing (or adding more) partial-register problems on P6-family, early Sandybridge-family, and CPUs that don't do any partial-register renaming. Plus, Intel since Ivybridge can do mov-elimination on it.

BTW, your first version of strcpy looks safe and correct, nice. And yes, the "memory" clobber is necessary.

Err, at least the inline asm is correct. You have C undefined behaviour from falling off the end of a non-void function without a return statement, though.

You could simplify the asm operands with a "+&S"(src) read/write operand instead of a dummy output because you're inside a wrapper function (so it's ok to modify this function's local src). Dummy output with matching constraint is the canonical way to take an input in a register you want to destroy, though.

(If you want to work like ISO C's poorly-designed strcpy, you'd want char *retval = dst ahead of the asm statement, if you're going to use the above suggestion of "+S" and "+D" operands. A better idea would be to call it stpcpy and return a pointer to the end of the destination. Also, your src should be const char*.)

Of course it's not particularly efficient to use lodsb/stosb in a loop, especially on CPUs that don't rename AL separately from RAX so every load also needs an ALU uop to merge into RAX. But byte-at-a-time is much worse than you can do with SSE2 so optimizing this with movzx loads, and maybe an indexed addressing mode is probably not worth the trouble. See https://agner.org/optimize/ and other optimization links in https://stackoverflow.com/tags/x86/info, especially https://uops.info/ for instruction latency / throughput / uop count. (stosb is 3 uops vs. 2 total for a mov store + inc edi.)

If you're actually optimizing for code-size over speed, just use 8-bit or 32-bit mov to copy registers, not movzbl.

BTW, with this many operands, you probably want to use named operands like [src] "+&S"(src) in the constraints, and then %[src] in the template.

Hey, so I tried using size overrides to adjust my registers so they all have the same value with `mov %b2, %3;` and I also changed the output type of %eax to be a char instead of an int with `"=&a" (ret)` and it seemed to have properly modified the register size (instruction in compiler-generated asm looks like `mov %al, %cl`). So by my understand of your explanation, there shouldn't be an issue with the data and register size. However, I am still unable to execute a compare instruction against this register with an arbitrary hexadecimal value (`cmp 0x70, %3`). — Dave, Feb 18 '20 at 06:20
@Tom: `0x70` is a memory operand with absolute address `0x70`. You want something like `cmp $0x70, %%al`. (No need to copy before branching, you might only need to read it; only bother with that in the `if` body if you need a 2nd copy. You know lodsb wrote AL, and you know you used a dummy output to tell the compiler about EAX specifically, so you might as well write normal-looking asm that uses that register name explicitly instead of an operand number. BTW, a clobber on `%eax` would work, too, if you don't care about reading the value outside the asm statement.) — Peter Cordes, Feb 18 '20 at 07:02

Copy a byte to another register in GNU C inline asm, where the compiler chooses registers for both operands

1 Answers1