Set an XMM register to a repeating byte pattern (broadcast a constant byte)

Question

I know that we can do something like this to move a character to a xmm register:

movaps xmm1, xword [.__0x20]

align 16
.__0x20 db 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20

but since this is a memory process, i want to know if there is any better way? (also, im talking about SSE2 not other SIMD types ...)

i want to each byte of xmm1 register be 0x20 not only one byte ..

(Editor's note: this can be called a broadcast or splat.
It's what the _mm_set1_epi8(0x20) intrinsic does.)

What you are doing is the fastest way of doing it when the desired byte is a constant. — fuz, Mar 29 '20 at 20:54
i just looking for a better way (if there is any) some immedate or (from register) way ! — ELHASKSERVERS, Mar 29 '20 at 20:55
Is your byte a constant or is it variable? If it is a constant, then what you do is already the fastest way. — fuz, Mar 29 '20 at 21:03
In that case, your code is already ideal. Depending on what assembler you use, there might be some sort of `times` or `dup` directive to make it easier to type. You could also define a macro if this annoys you. — fuz, Mar 29 '20 at 21:16

Peter Cordes · Accepted Answer · 2020-11-29T21:22:52.030

With only SSE2, loading the full pattern from memory is generally your best bet.

In your NASM source you can use times 16 db 0x20 for easy maintainability.

With SSE3 you can do 8-byte broadcast loads with movddup. With AVX you can do a 4-byte broadcast-load with vbroadcastss. These broadcast-loads are very good on modern CPUs, running on just the load port, not needing a shuffle uop. i.e. they're exactly as cheap as movaps on CPUs that support them, except for a byte or two more code-size. Same for vbroadcastf128 to YMM registers.

Most compilers don't seem to realize this and will do constant-propagation through _mm_set1 even when that results in a 32 byte constant instead of 4 bytes, even when just mov... loading it ahead of a loop, not folding it into a memory operand for an ALU instruction. (And that's still possible with broadcast-loading when AVX512 is available.) Clang does sometimes take advantage of broadcast loads for simple constants.

AVX2 adds vpbroadcastb/w/d/q, but only dword and qword are pure load uops. Byte and word broadcast-loads need an ALU shuffle uop, so for constant byte patterns you probably want to just broadcast-load a dword that repeats a byte 4 times. (Unless it's an element from a big lookup table, then compress the table by using a byte or word broadcast load, or a pmovsx sign-extending load or whatever).

AVX512 adds vpbroadcastb/w/d/e from an integer register so you could mov eax, 0x20202020 / vpbroadcastd xmm0, eax if you have AVX512VL.

With SSE2 it would take at least 2 instructions including an ALU shuffle, like this, and may not be worth it.

    movd    xmm0, [const_4B]
    pshufd  xmm0, xmm0, 0

Some repeating constants can be generated on the fly in a couple instructions, starting with all-ones from pcmpeqd xmm0,xmm0. See What are the best instruction sequences to generate vector constants on the fly? and Agner Fog's guide.

This pattern does not appear to be easy to generate. It's a byte pattern (not word, dword, or qword) and SSE shifts are only available with word granularity at best. However, if we know the bits shifted across byte boundaries are 0, it's fine. e.g.

   pcmpeqd  xmm0, xmm0     ; set1( -1 )
   pabsb    xmm0, xmm0     ; set1_epi8(1)    SSSE3
   pslld    xmm0, 5        ; set1_epi8(1<<5)

; or with only SSE2, something even less efficient like shift / packsswb / shift

This is unlikely to be worth it unless you really want to avoid the possibility of a cache miss for the constant. On average a load will usually come out ahead.

Are you aware of any answers to this question for GP 64 bit register? — Noah, Mar 13 '21 at 01:06
@Noah: For a constant, normally just `mov rdi, 0x0101010101010101` or whatever. For a non-constant, `imul rcx, rdi` with that 0x01 repeating constant, after zero-extending the byte into RCX. So a worst-case cost of `mov reg,imm64` for the multipliers, `movzx ecx, byte source`, and `imul r64,r64`. — Peter Cordes, Mar 13 '21 at 01:23
@Noah: yeah, fast hardware multipliers can be "abused" to do many neat things, including summing small-enough elements into the high byte. ([How to count the number of set bits in a 32-bit integer?](https://stackoverflow.com/a/109025)). A multiply being a shift-and-add operation with adding or not being controlled by the bits of the other value. Also @ phuclv explains the mechanics nicely in [How to create a byte out of 8 bool values (and vice versa)?](https://stackoverflow.com/a/51750902) — Peter Cordes, Mar 13 '21 at 02:22

Set an XMM register to a repeating byte pattern (broadcast a constant byte)

1 Answers1