Assume I have a C++ std vector of doubles which should be loaded into an AVX2 register. This can simply be done by using the _mm256_load_pd(&vector1[0]) command.
The vector can have any size and must not be a multiple of 4. What would now be the most effective and efficient way to load the remaining vector elements if the vector size is not a multiple of 4?
Asked
Active
Viewed 722 times
1
vydesaster
- 233
- 1
- 10
-
What do you intend to do with the register? Does it matter at what position the remaining elements are stored inside the register? – chtz Mar 02 '20 at 01:12
-
@chtz: I have 4 or 5 different vectors where I would like to perform addition and multiplications element by element. All vectors have the same length. It does not matter at what position the remaining elements are stored inside the register. – vydesaster Mar 02 '20 at 07:19
3 Answers
3
Pad your array to be divisible by four, which wastes memory but removes inefficiencies of if statements and branching.
FShrike
- 323
- 1
- 10
-
Unfortunately I cannot do this at the original vector. Would it also be a good solution to copy the non matching vector to a new matching vector? Will the copy process cost too much time? Matching here is meant in terms of size. – vydesaster Mar 01 '20 at 20:00
-
-
The vectors are part of a matrix. Extending each vector would mean to extend the matrix which will lead to a lot of extra code and branching to other functions working on the matrix. – vydesaster Mar 01 '20 at 20:06
-
Possibly you could insert temporary null values when you need to insert the vectors into AVX and then remove them when you’re done but since that happens in whatever efficient loop you need that may slow things down to the point of uselessness - only way is to time things yourself. – FShrike Mar 01 '20 at 20:08
-
-
1@vydesaster: The normal way to handle this is to have a row "stride" (distance between 2 rows) that's separate from the actual logical row "width" (number of columns that actually matter). So you do indexing calculations with the row stride, but column loop bounds with the width. It's very common in computer graphics to do this, where it also allows you to pass a cropped rectangle to another function without copying. (Row stride is still the same, start and width are a subset of the full row.) – Peter Cordes Mar 01 '20 at 22:02
2
You can use the _mm256_maskload_pd instruction. It takes a second parameter to indicate which values to load.
1201ProgramAlarm
- 32,384
- 7
- 42
- 56
-
-
-
@1201ProgramAlarm: You have to expand that bitmask to a vector element mask; it's non-trivial to do that efficiently. A load with a sliding window into a cache line is one way, otherwise ALU computation. Or since the OP has AVX512, use AVX512 zero-masking so you can just use a *bit* mask. `__m256d _mm256_maskz_load_pd( __mmask8 k, void * m);` The `__mmask8` type is just `uint8_t` so it converts freely from integer types. – Peter Cordes Mar 01 '20 at 22:00
-
-
@PeterCordes: Thanks for the feedback. Do you also see a solution for pure AVX2 code without AVX512? – vydesaster Mar 01 '20 at 22:08
-
@vydesaster: [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) or [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675) – Peter Cordes Mar 01 '20 at 22:21
2
If you want to load the elements to do element-wise operations (and store them back to the same or to another vector afterwards) an easy solution is to use overlapping loads/stores.
Simplified example (needs special handling if vect.size()<4)
// load last four elements for later use
__m256d last_input = _mm256_loadu_pd(vect.data()+vect.size()-4);
for(size_t i=0; i<vect.size()-4; i+=4) { // main loop
__m256d input = _mm256_loadu_pd(vect.data()+i);
_mm256_storeu_pd(some_operation(input), output.data()+i);
}
// process and store last elements (possibly overlapping with previous store):
_mm256_storeu_pd(some_operation(last_input, output.data()+vect.size()-4);
Make sure to compile with optimizations, and on gcc/clang with -march=native (otherwise the unaligned loads/stores may inefficiently get split).
chtz
- 17,329
- 4
- 26
- 56
-
Assume the vector size is 6, wouldn't your solution operate twice on the vector elements 2 and 3 when index starts at 0? – vydesaster Mar 02 '20 at 10:52
-
1Yes, but with the same input. I.e., it does some redundant calculations, but saves a lot of complicated logic. If you are on a CPU where AVX-128 is faster than AVX-256, you may consider investing in that extra logic, though (same for AVX-512 vs AVX-256, of course). – chtz Mar 02 '20 at 12:39
-
Ah, now I see the point. I will investigate your solution. Thanks so far. – vydesaster Mar 02 '20 at 12:53