1

Consider that I have a data-table in .rodata section … Now in my function, I want to use that data-table, 3-4 times ... I have 2 options:

option 1 (less code-size):

 mov    rax, MY_DATA_TABLE
 vpbroadcastb zmm2, BYTE [rax+64]
 vpbroadcastb zmm3, BYTE [rax+128]
 vpbroadcastb zmm4, BYTE [rax+192]

option 2 (more code-size but I think better performance (less latency I think)):

 vpbroadcastb zmm2, BYTE [MY_DATA_TABLE+64]
 vpbroadcastb zmm3, BYTE [MY_DATA_TABLE+128]
 vpbroadcastb zmm4, BYTE [MY_DATA_TABLE+192]

Which one is better at all? Please attention that I don't talk about just vpbroadcastb !! I'm talking about everything ...

My data-table holds 256 x 16-byte (for 0 to 255 (packed byte)) ... it's not 256 x 64-byte ...

HelloMachine
  • 355
  • 2
  • 8
  • Generally fewer uops is better, unless code-size gets extreme and becomes a bottleneck in your profiling. RIP-relative addressing modes can't micro-fuse in instructions with an immediate operand, but `vpbroadcastb` doesn't have one. If you are going to put an address into a register, though, see [How to load address of function or label into register](https://stackoverflow.com/q/57212012) - either `mov eax, foo` in a non-PIE executable for Linux, otherwise `lea rax, [rel foo]`. Latency isn't really the concern; OoO exec can run the instruction that puts an addr int oa reg early. – Peter Cordes May 26 '22 at 18:52
  • `vpbroadcastb` to load every 64th byte looks really odd. A much more normal use-case would be `vmovdqa64` to load full vector constants that aren't the same in every byte (e.g. `vpermb` shuffle controls, or LUT constants for `vpermt2d` or whatever that you couldn't have compressed with `vpmovzxbd` or a broadcast-load, so you'd expect the load addresses to be separated by 64 bytes). – Peter Cordes May 26 '22 at 19:00
  • For 3 constant bytes, you might `mov eax, 0x112233` / `vpbroadcastb zmm2, eax`... Hmm, no, that would need another scalar instruction to set up a register for the vector op. Maybe `mov dword [rsp-4], 0x112233` into the red-zone and then vpbroadcastb reloads? Not ideal; each vpbroadcastb is a load+shuffle, although it does micro-fuse. Maybe still best to do 3 `vpbroadcastd` 32-bit broadcasts from `.rodata` (just load port uops) from addresses separated by 4. – Peter Cordes May 26 '22 at 19:04
  • If you did need 64-byte constants, you'd want `mov eax, table+128` so you could use `[rax-64]`, `[rax]`, and `[rax+64]`, disp8, disp0, disp8, since none of the offsets are outside the [-128, +127] range. `mov eax, table+128` is the same size as `mov eax, table`; it's a 32-bit address. (Same for RIP-relative, that's always the same size.) – Peter Cordes May 26 '22 at 19:06
  • @PeterCordes i want to perform byte compare so i need my vectors to hold special byte on their each byte so i need broadcast ... `vmovdqa64` would be a waste of space since i have to define a byte 64 times. You say doing this, is better ? Doing `vmovdqa64` is better than broadcast way? – HelloMachine May 26 '22 at 19:09
  • `[MY_DATA_TABLE+64]` and `[MY_DATA_TABLE+128]` already are 64 bytes apart, so unless some other code uses `[MY_DATA_TABLE+65..127]` for something, you already are wasting 64 bytes per byte constant. And it's a constant address, not indexing which byte to broadcast. If you only have 3 different bytes, I'd have expected `vpbroadcastb zmm2, [lut]` / `vpbroadcastb zmm3, [lut+1]` etc. – Peter Cordes May 26 '22 at 19:11
  • @PeterCordes My data-table holds 256 x 16-byte (for SSE use, since SSE has no broadcast) ... i updated my code .. i don't mean my data-table has 256 x 64-byte ... i just use special bytes in special offsets ... for example, here 64 means number 4 (packed byte (16 bytes for each bye (from 0 to 255)) ... – HelloMachine May 26 '22 at 19:16
  • 1
    SSE2 has `pshufd xmm2, xmm0, 0b00_00_00_00` / `pshufd xmm3, xmm0, 0b01_01_01_01` etc. after a single 16-byte load. Also, SSE3 has `movddup` 8-byte broadcast loads. Replicating each by 4 times (instead of 16) is a decent tradeoff between space and speed even if you have to care about SSE2, and even if you have to reload constants enough that it's worth the larger cache footprint. – Peter Cordes May 26 '22 at 19:33

0 Answers0