How to keep input-dependent hot data in registers when using SIMD intrinsics

Question

I am trying to use Intel SIMD intrinsics to accelerate a query-answer program. Suppose query_cnt is input dependent but is always smaller than SIMD register count (i.e. there is enough SIMD registers to hold them). Since queries are the hot data in my application, instead of loading them each time when needed, may I load them at first and keep them always in registers?

Suppose queries are float type, and AVX256 is supported. Now I have to use something like:

std::vector<__m256> vec_queries(query_cnt / 8);
for (int i = 0; i < query_cnt / 8; ++i) {
    vec_queries[i] = _mm256_loadu_ps((float const *)(curr_query_ptr)); 
    curr_query_ptr += 8;
}

I know it is not a good practice since there is potential load/store overhead, but at least there is a slight chance that vec_queries[i] can be optimized so that they can be kept in registers, but I still think it is not a good way.

Any better ideas?

Are you processing multiple queries in a loop that doesn't do anything else? If not, your data won't still be in registers when you get to the next query. Or are you saying you think it might be worth it to use global register variables for queries? GNU C can do this, `__m256 vec_query0 asm("ymm0");` should be the right syntax IIRC. — Peter Cordes, Sep 21 '16 at 07:15
Have you looked at the actual asm to make see that your vectors *aren't* being kept in registers? If you're lucky, the compiler might be optimizing away most of the std::vector dynamic-allocation overhead. If not, try using a fixed-size array (since you have a low upper-bound on its size). — Peter Cordes, Sep 21 '16 at 07:19
@PeterCordes Thanks for your advice. I did not look at the actual asm, but your suggestion that using a fixed-size array may be a good option, so that I can use something like `__m256 vec_query0 asm("ymm0")` to bind every array element to a register, but if I do so, will some of the registers always be occupied by the fixed elements which might lead to performance penalty? — MarZzz, Sep 21 '16 at 07:29
Yes, they would be *permanently* pinned to those variables throughout your entire code base, with gcc assuming that nothing else ever touches them. You'd probably have to recompile all your libraries that way, too, since no ABI has any call-preserved ymm registers. (In the x64 Windows ABI, some xmm regs are preserved by function calls). [Global register variables](https://gcc.gnu.org/onlinedocs/gcc/Global-Register-Variables.html) are almost *never* a good idea for performance, even if it would work for this. — Peter Cordes, Sep 21 '16 at 08:31
Also notice that I only mentioned scalars: you can't index a group of registers with a run-time variable. The register used by an asm instruction is encoded directly into the machine code, so there could be no indexing of your array, only fully-unrolled "loops". — Peter Cordes, Sep 21 '16 at 08:36

Peter Cordes · Accepted Answer · 2016-09-21T08:53:17.003

From the code sample you posted, it looks like you're just doing a variable-length memcpy. Depending on what the compiler does, and the surrounding code, you might get better results from just actually calling memcpy. e.g. for aligned copies of with a size that's a multiple of 16B, the break even point between a vector loop and rep movsb is maybe as low as ~128 bytes on Intel Haswell. Check Intel's optimization manual for some implementation notes on memcpy, and a graph of size vs. cycles for a couple different strategies. (Links in the x86 tag wiki).

You didn't say what CPU, so I'm just assuming recent Intel.

I think you're too worried about registers. Loads that hit in L1 cache are extremely cheap. Haswell (and Skylake) can do two __m256 loads per clock (and a store in the same cycle). Previous to that, Sandybridge/IvyBridge can do two memory operations per clock, with a max of one of them being a store. Or under ideal conditions (256b loads/stores), they can manage 2x 16B loaded and 1x 16B stored per clock. So loading/storing 256b vectors is more expensive than on Haswell, but still very cheap if they're aligned and hot in L1 cache.

I mentioned in comments that GNU C global register variables might be a possibility, but mostly in a "this is technically possible in theory" sense. You probably don't want multiple vector registers dedicated to this purpose for the entire run-time of your program (including library function calls, so you'd have to recompile them).

In reality, just make sure the compiler can inline (or at least see while optimizing) the definitions for every function you use inside any important loops. That way it can avoid having to spill/reload vector regs across function calls (since both the Windows and System V x86-64 ABIs have no call-preserved YMM (__m256) registers).

See Agner Fog's microarch pdf to learn even more about the microarchitectural details of modern CPUs, at least the details that are possible to measure by experiment and tune for.

I tried `memcpy` between `__m256 a` and `float b[8]` to imitate `_mm256_loadu_ps` and `_mm256_storeu_ps`, and found it really worked. At first I thought SIMD intrinsic data type like `__m256` or `__m256i` are special data types that might have more chances to be kept in registers. Now I changed my mind that they seem to be no special compared with their counterparts like `float` or `int`, but with more alignment restrictions, is this correct? If so, if I replace all the SIMD `loads/stores` with `memcpy`, what might be the potential performance difference? — MarZzz, Sep 22 '16 at 01:59
@MarZzz: Yes, they work very much like scalar `float` or `int` types. If you're about to use a `__m256` (or `__m256i`) for more calculations, use the load intrinsics. (or use a simple assignment from another `__m256`). If you want to move a bunch of data around, use memcpy. You wouldn't use memcpy to assign `array[i]` to `float tmp`, so don't do it for SIMD types. It will *probably* optimize away, but it's less readable and certainly doesn't help the compiler. — Peter Cordes, Sep 22 '16 at 02:08
If you're curious about stuff like this, look at the compiler output. It's a lot easier to sort of follow some code that the compiler generated than it is to write your own assembly from scratch, so you don't have to really know asm to try this. Put some code up on http://gcc.godbolt.org/ to get a [nicely formatted view](http://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output) of the asm (with optional colour highlighting to show which source line produced which asm line). — Peter Cordes, Sep 22 '16 at 02:10

How to keep input-dependent hot data in registers when using SIMD intrinsics

1 Answers1