0

I need to move 1 byte from an xmm register to memory without using general purpose registers. And also I can't use SSE4.1. It is possible?

=(

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Does this answer your question? [How to move 128-bit xmm directly to memory without using registers?](https://stackoverflow.com/questions/20821869/how-to-move-128-bit-xmm-directly-to-memory-without-using-registers) – Alejandro Jun 28 '21 at 17:06
  • I do not believe there is a way to do that without either using a general purpose register or SSE4.1 `pextrb`. – fuz Jun 28 '21 at 17:06
  • @Alejandro I think you missed the part where OP only wants to move 1 byte, not the whole register. – fuz Jun 28 '21 at 17:07
  • 2
    @fuz: You can with an [SSE2 maskmovdqu](https://www.felixcloutier.com/x86/maskmovdqu) masked store, but you don't want to because of the NT semantics and being generally slow, much worse than `movd eax, xmm0` / `mov [mem], al` – Peter Cordes Jun 28 '21 at 18:02
  • @peterCordes oh, i think that is going to work. cant use general purpose registers because it's homework for college. i'm goint to trie maskmovdqu, thanks – Martín Funes Jun 28 '21 at 18:47
  • Sounds like a bizarre assignment. Are you sure you need to store a single byte, instead of for example doing a wider load and merging, then storing back the merge result? It's not thread-safe (non-atomic RMW of the unmodified bytes), but it's normal if you have multiple bytes to modify. – Peter Cordes Jun 28 '21 at 18:55
  • @peterCordes yeap, its a bizarre assignament. i need to modify multiples bytes, butthe thing is, i have to write in memory 24 bits pixels, and five pixels are 120 bits, so, i cant write the entire image using 128 bits because i would have seg fault in the last iteration. It would be really ugly to think how to do all the modifies changing the alignment. – Martín Funes Jun 28 '21 at 19:26
  • 1
    Normally you'd just do the leftover bytes with scalar, or with a vector store that can partially overlap. i.e. a final store that ends at the end of the array, and even if that overlaps some earlier stores. As long as the total size >= 16 bytes, this works. If your modification is idempotent (you can safely process the same byte twice, e.g. `a[i] &= ~0x20` but not `a[i] += 10`, or it's write-only, then it's no problem. – Peter Cordes Jun 28 '21 at 19:31

1 Answers1

2

Normally you'd want to avoid this in the first place. For example, instead of doing separate byte stores, can you do one wider load and merge (pand/pandn/por if you don't have pblendvb), then store back the merge result?

That's not thread-safe (non-atomic RMW of the unmodified bytes), but as long as you know the bytes you're RMWing don't extend past the end of the array or struct, and no other threads are doing the same thing to other elements in the same array/struct, it's the normal way to do stuff like upper-case every lower-case letter in a string while leaving other bytes unmodified.


Single-uop stores are only possible from vector registers in 4, 8, 16, 32, or 64-byte sizes, except with AVX-512BW masked stores with only 1 element unmasked. Narrower stores like pextrb involve a shuffle uop to extract the 2 or 1 byte to be stored.

The only good way to truly store exactly 1 byte without GP integer regs is with SSE4.1 pextrb [mem], xmm0, 0..15. That's still a shuffle + store even with an immediate 0 on current CPUs. If you can safely write 2 bytes at the destination location, SSE2 pextrw is usable.

You could use an SSE2 maskmovdqu byte-masked store (with a 0xff,0,0,... mask), but you don't want to because it's much slower than movd eax, xmm0 / mov [mem], al. e.g. on Skylake, 10 uops, 1 per 6 cycle throughput.

And it's worse than that if you want to reload the byte after, because (unlike AVX / AVX-512 masked stores), maskmovdqu has NT semantics like movntps (bypass cache, or evict the cache line if previously hot).


If your requirement is fully artificial and you just want to play silly computer tricks (avoiding ever having your data in registers), you could also set up scratch space e.g. on the stack and use movsb to copy it:

;; with destination address already in RDI
    lea  rsi, [rsp-4]          ; scratch space in the red zone below RSP on non-Windows
    movd  [rsi], xmm0
    movsb                   ; copy a byte, [rdi] <- [rsi], incrementing RSI and RDI

This is obviously slower than the normal way and needed an extra register (RSI) for the tmp buffer address. And you need the exact destination address in RDI, not [rel foo] static storage or any other flexible addressing mode.

pop can also copy mem-to-mem, but is only available with 16-bit and 64-bit operand-size, so it can't save you from needing RSI and RDI.

Since the above way needs an extra register, it's worse in pretty much every way than the normal way:

   movd  esi, xmm0            ; pick any register.
   mov   [rdi], sil           ; al..dl would avoid needing a REX prefix for low-8


;; or even use a register where you can read the low and high bytes separately
   movd  eax, xmm0
   mov   [rdi], al            ; no REX prefix needed, more compact than SIL
   mov   [rsi], ah            ; scatter two bytes reasonably efficiently
   shr   eax, 16              ; bring down the next 2 bytes

(Reading AH has an extra cycle of latency on current Intel CPUs, but it's fine for throughput, and we're storing here anyway so latency isn't much of a factor.)

xmm -> GP integer transfers are not slow on most CPUs. (Bulldozer-family is the outlier, but it's still comparable latency to store/reload; Agner Fog said in his microarch guide (https://agner.org/optimize/) he found AMD's optimization-manual suggestion to store/reload was not faster.)

It's hard to imagine a case where movsb could be better, since you already need a free register for that way, and movsb is multiple uops. Possibly if bottlenecked on port 0 uops for movd r32, xmm on current Intel CPUs? (https://uops.info/)

ecm
  • 2,583
  • 4
  • 21
  • 29
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847