Filling an AVX512 register with incrementing bytes

Question

Are there any non-obvious tricks to fill an AVX512 register with incrementing bytes (little-endian)? That is, the equivalent of this code:

__m512i make_incrementing_bytes(void) {
    /* Compiler optimizes this into an initialized array in .rodata. */
    alignas(64) char data[sizeof(__m512i)];
    for (unsigned i = 0; i < sizeof(data); i++) {
        data[i] = i;
    }
    return _mm512_load_si512(data);
}

The only obvious approach I see (and the one that GCC produces with the above code) is to just take the generic approach of using a vmovdqa64 from memory - but this constant is low-entropy enough that it seems like one ought to be able to do better, somehow.

(I know that normally constant loads aren't typically in the critical path, or you have a spare register to dedicate to the constant to be able to reload it, but I'm interested if there are any tricks buried in this instruction set. For an instruction set with a full-width register multiply, for instance, you can fill every byte with 0x1, square the register, and left-shift the result by one - but that isn't suited to AVX512 so far as I can tell.)

I am somewhat glossing over _which_ AVX512 extensions here, fairly deliberately. If an answer is "hey, there's this neat thing in this obscure extension that isn't available yet", I've still learned something :-) — TLW, Nov 27 '22 at 17:11
I'm not aware of any pattern/sequence trick in asm, unfortunately. It's easy to get the *same* thing in every element, as in [What are the best instruction sequences to generate vector constants on the fly?](https://stackoverflow.com/q/35085059) , but nothing naturally does a different thing to each element. At best I could imagine it might be possible to do some kind of shuffle/add step to build up a widening pattern with log2 64 steps, but I'm not sure what the steps might look like, and that's a lot of ops vs. a load. — Peter Cordes, Nov 27 '22 at 17:31
Normally you'd just `_mm512_loadu_si512(data)` instead of over-complicating things with `memcpy`. Deref of a `__m512i*` [is strict-aliasing safe](https://stackoverflow.com/questions/52112605/is-reinterpret-casting-between-hardware-simd-vector-pointer-and-the-correspond), the actual reason you couldn't do it safely is that you didn't use `alignas(64)` on the array. I'd recommend changing that in the question to keep it focused on the question about generating vector constants. Loop + load vs. just loading a 64-byte constant is a size vs. speed tradeoff, if the loop can be smaller than 64B — Peter Cordes, Nov 27 '22 at 17:38
@PeterCordes - thank you. I'm just getting into AVX512, and haven't figured everything out yet, as you've noted. I should also make an explicit comment in the code that I'm expecting that entire loop to be constant-propagated out, leaving just the load. I'll do so now. — TLW, Nov 27 '22 at 17:41
Interesting trick with `0x01010101` squared, yeah that could be a good starting point, except you might as well start directly with `mov eax, 0x01020304` / `vpbroadcastd xmm0, eax` (or ZMM) or `vmovd xmm0, eax`, or even with `mov rax, 0x0102030405060708` / `vpbroadcastq zmm0, rax` which are cheaper than `vternlogd zmm0,zmm0,zmm0, -1` / `vpabsb zmm0, zmm0` / `vpmullq zmm0,zmm0,zmm0` (Or a 64-bit constant). There's no widening 64-bit => 128-bit multiply; AVX-512 does have `vpmullq` which AVX2 doesn't, but it's 2 uops on Intel CPUs. (One on Zen4). — Peter Cordes, Nov 27 '22 at 17:41
@PeterCordes - look carefully. It actually generates 12345654321 (well, bytes, and much larger, but I'm not typing that out :-)). You can generate either endianness depending on if you take the high or low half of the multiplication. — TLW, Nov 27 '22 at 17:49
BTW, is this actually an intrinsics question? I notice the code in your question is in C or C++. Even if you come up with a clever way to generate a constant in a few instructions, some compilers (gcc and clang) will constprop through it and make a 64-byte constant whether you want them to or not. Making your own choices of how to generate constants in registers is usually only something you get to choose in asm. — Peter Cordes, Nov 27 '22 at 17:49
Oh, you're right, I was looking at the high half of squaring 0x0101010 ** 2 to get `0x1020304030201`. The low half is in the right order for little-endian machines, the high half is reversed. (And it's not even the high half, just the high 56 bits.) — Peter Cordes, Nov 27 '22 at 17:51
@PeterCordes - it's an assembly question, I just don't know a concise way to write the constant to load in assembly that both concise and clear. (`val: .rept 64 / .byte .-val / .endr` satisfies the former but not the latter, for instance.) Feel free to rewrite in asm if you think it suits it better. — TLW, Nov 27 '22 at 22:22
Updated my answer to address that last point of how to write the constant in your source without explicitly listing every value. I think with a comment, your idea is really not bad. A NASM version of that looks easier to read than one using `%assign`, especially with a comment to remind readers of what it does. It's of course easy to test and hard to break during maintenance as it's, not affected by other code. (Except for alignment; even though 64B cache-line-split loads are allowed, they take more throughput resources.) — Peter Cordes, Nov 27 '22 at 22:46
(Yeah, in practice you'd want a `.align 64` or equivalent, of course. I just overestimated how confusing the asm version would be to readers here.) — TLW, Nov 27 '22 at 23:32

Peter Cordes · Accepted Answer · 2023-06-06T15:18:04.427

I don't think there's any very efficient way to generate a sequence like that on the fly where different elements have different values. 64 different byte values is pretty high entropy if you can't take advantage of the similarity to previous elements.

It's only easy to broadcast 4-byte or 8-byte patterns (from mov-immediate to an integer register), or 4, 8, 16, or 32-byte patterns from memory. Or with vpmovzxbd for example, "compress" the storage of shuffle constants with wider elements (word, dword or qword), at the cost of an extra shuffle uop when you load it. Or to generate something on the fly where every element has the same value starting from a vector of all-ones bytes. But unless you're writing asm by hand, compilers will constant-propagate through intrinsics so you're at their mercy. Some of them are smart enough to use broadcast loads instead of expanding _mm512_set1_epi32(0x03020100) into 64 bytes, but not always.

There aren't instructions which do something different to each element, and the multiply trick is limited to a width of 64-bit chunks.

Interesting trick with 0x01010101 squared, that could be a good starting point, except you might as well start directly with mov eax, 0x00010203 / vpbroadcastd xmm0, eax (or ZMM) or vmovd xmm0, eax, or 64-bit mov rax, 0x0001020304050607 (10 bytes) / vpbroadcastq zmm0, rax (6 bytes) which are cheaper than vternlogd zmm0,zmm0,zmm0, -1 / vpabsb zmm0, zmm0 (to get set1_epi8(1)) plus vpmullq zmm0,zmm0,zmm0 / vpsllq zmm0, zmm0, 8.

There's not even a widening 64-bit => 128-bit multiply although AVX-512 does have vpmullq which AVX2 doesn't. However it's 2 uops on Intel CPUs. (One on Zen4).

Each AVX-512 instruction is at least 6 bytes (4-byte EVEX + opcode + modrm), so that adds up quickly if you're optimizing for pure size of .text+.rodata (which might not be unreasonable outside a loop). You still wouldn't want an actual loop that stored 4 bytes at a time for 16 iterations, like add eax, 0x04040404 / stosd, that would be slower than you want even outside a loop.

Starting with set1_epi32(0x03020100) or a 64-bit or 128-bit version would still need multiple shuffle and add steps to widen up to 512-bit, with the right amount of 0x04, 0x08, or 0x10 added to each part of the broadcast result.

I can't think of anything better, and it's still not good enough to use. Using some AVX2 instructions saves code size vs. ZMM all the way, unless I'm missing a way to save an instruction.

The strategy is to create [ 0x30 repeating | 0x20 repeating | 0x10 repeating | 0x00 repeating] in a ZMM and add it to a broadcast 16-byte pattern.

default rel
  vpbroadcastd     ymm1, [vec4_0x10]   ; we're loading another constant anyway, this is cheaper
  vpaddd           ymm2, ymm1,ymm1     ; set1(0x20)
  vmovdqa          xmm3, xmm1          ; [ set1(0)   , set1(0x10) ]     ; mov-elimination
  vpaddd           ymm4, ymm3, ymm2    ; [ set1(0x20), set1(0x30) ]
  vshufi32x4       zmm4, zmm3, zmm4, 0b00_01_00_01    ; _MM_SHUFFLE(0,1,0,1) works like shufps but in 16-byte chunks.
  vbroadcasti64x2  zmm0, [vec16_0to15]
  vpaddb           zmm0, zmm0, zmm4     ; memory-source broadcast only available with element size, e.g. vpaddq z,z,m64{1to8} but that'd take more granular shuffling

section .rodata
align 16
  vec16_0to15: db 0,1,2,3,4,5,6,7
              db 8,9,10,11,12,13,14,15

  vec4_0x10: dd 0x10101010

Size: machine code: 0x2c bytes. Constants: 16 + 4 = 0x14.
Total: 0x40 = 64 bytes, the same as putting the whole literal constant in memory.

Masking might have saved vector instructions, at the cost of needing to set up mask-register values which costs mov eax, imm32 / kmov k1, eax.

A less-bad tradeoff of instruction (uop) count vs. total size could be to start with a 32-byte 0..31 constant so you just need to set 1 bit in the upper half after broadcasting.

;; update: this is a better tradeoff, 61 total bytes and far fewer instructions
;; 25 bytes of machine code in 3 instructions
default rel
   vmovdqa ymm0, [vec_0to31]                ;  0..31
   vpord   ymm1, ymm0, [mask_0x20]{1to8}    ; 32..63
   vinserti32x8 zmm0, zmm0, ymm1, 1

section .rodata
;; 36 bytes of data
align 32
vec_0to31: db 0..31    ; see below for a way to actually write this in NASM
mask_0x20: dd 0x20202020

The 16-byte-chunk way saves about 10 bytes, the size of a ZMM load with a RIP-relative addressing mode to get it into a register from .rodata. Or 4 bytes, the size of a RIP-relative addressing mode, the difference between vpaddb zmm0, zmm0, zmm31 vs. vpaddb zmm0, zmm0, [vector_const] depending what you're doing with it.

$ objdump -drwC -Mintel foo
0000000000401000 <_start>:
  401000:       c4 e2 7d 58 0d 07 10 00 00      vpbroadcastd ymm1,DWORD PTR [rip+0x1007]        # 402010 <vec4_0x10>
  401009:       c5 f5 fe d1             vpaddd ymm2,ymm1,ymm1
  40100d:       c5 f9 6f d9             vmovdqa xmm3,xmm1
  401011:       c5 e5 fe e2             vpaddd ymm4,ymm3,ymm2
  401015:       62 f3 65 48 43 e4 11    vshufi32x4 zmm4,zmm3,zmm4,0x11
  40101c:       62 f2 fd 48 5a 05 da 0f 00 00   vbroadcasti64x2 zmm0,XMMWORD PTR [rip+0xfda]        # 402000 <vec16_0to15>
  401026:       62 f1 7d 48 fc c4       vpaddb zmm0,zmm0,zmm4

$ size foo
   text    data     bss     dec     hex filename
     64       0       0      64      40 foo

I did confirm this works with GDB attached to SDE:

# stopped before the last   vpaddb
(gdb) p /x $zmm0.v64_int8 
$2 = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0,
  0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf}
(gdb) p /x $zmm4.v64_int8
$3 = {0x0 <repeats 16 times>, 0x10 <repeats 16 times>, 0x20 <repeats 16 times>, 0x30 <repeats 16 times>}

(gdb) si
0x000000000040102c in ?? ()
(gdb) p /x $zmm0.v64_int8 
$4 = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d,
  0x1e, 0x1f, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39,
  0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f}

If you were considering doing something like this, use the version that starts with a 32-byte constant since 3 instructions is not unreasonable, and it's less total size. (Assuming you don't lose space to padding from aligning the constant, or following the 4-byte constant. You could leave it unaligned, especially if you know it doesn't cross a cache-line boundary. Extra latency from a cache-line split might stop out-of-order exec from getting started on the work that uses the constant, so that's undesirable.)

@chtz suggests another alternative in comments:

You can create a {0,...,0,8,...,16,...24,...} vector using vpmovzxbq from {0,1,2,...,7} combined with a vpmultishiftqb with a broadcasted -3. Then add a broadcasted 0x0001020304050607 (can use the same memory as the vpmovzxbq).

I haven't tested this, but could be interesting, especially if you want to use only immediates, no loads from .rodata. mov rax, 0x0706050403020100 / vpbroadcastq zmm0, rax / vpmovzxbq zmm1, xmm0 gives you the two constants based on that. With memory sources you could use vporq or vpaddq with a [mem]{1to8} instead of a separate vpbroadcastq. Getting a -3 vector might just be mov rax, -3 / vpbroadcastq. Still 2 instructions, but one of them scalar integer not competing for vector execution units.

Writing 0..n constants in asm source

I just don't know a concise way to write the constant to load in assembly that both concise and clear

(val: .rept 64 / .byte .-val / .endr satisfies the former but not the latter, for instance.)

That's a neat use of GAS syntax (although of course ; is the statement separator if you want to actually put it all on one line.) Seems like a comment on it would be sufficient.

In NASM syntax, %assign inside %rep 64 would be the natural way, as shown in the NASM manual's example of using %rep for unrolling a loop. In this case,

align 64
vec64_0to63:        ; self-explanatory name for the constant points readers in the right direction
  %assign i 0 
  %rep    64 
    db  i
    %assign i i+1 
  %endrep

Something equivalent is possible in GAS with .set.

%xdefine would be usable, too, although that would make the assembler eval a growing 0+1+1+1+1+... text string every time.

Conversely, your idea in NASM syntax looks like this, where a comment and the label name remind readers how it works. I actual prefer this to the %assign version; there's less going on to keep track of.

vec64_0to63:
%rep 64
    db $-v2       ; 0..63  value = offset
%endrep

Doing it all on one line with times doesn't work: v2: times 16 db $-v2 fills with zeros, because $-v2 is evaluated to a constant zero before being repeated.

You can create a `{0,...,0,8,...,16,...24,...}` vector using `vpmovzxbq` from `{0,1,2,...,7}` combined with a `vpmultishiftqb` with a broadcasted `-3`. Then add a broadcasted `0x0001020304050607` (can use the same memory as the `vpmovzxbq`). — chtz, Nov 27 '22 at 23:23
@chtz: sounds worth of an answer to lay out the steps! You could bcast load and then register source `vpmovzxbq`, since vpmovzx can't micro-fuse a memory source on Intel when the destination is YMM or wider. I guess it might save instructions and code-size to use `vpaddq y,y,[mem]{1to4}`? Oh, that only gets you one half of the full thing, so you'd need need one more step of broadcasting that and adding `set1(0x20)`? — Peter Cordes, Nov 28 '22 at 01:53

Filling an AVX512 register with incrementing bytes

1 Answers1

Writing 0..n constants in asm source