1

If I know I have e.g. at least 4 doubles sitting at given (aligned) location in memory, double *d, I can simply do __m256d x = _mm256_load_pd(&d[i]), i.e. load them into an AVX(2) register.

The question is: How do I correctly handle cases where there aren't 4 doubles left at the given location, i.e. I'd theoretically access the array out of bounds?

One solution that I have been using so far is to only allocate memory of multiples of 4 * 8 bytes in this specific case. Alternatively, for cases where I do not control the memory allocation completely, I have also been playing with stuff like this, assuming that this probably not the way to go:

static __m256d inline _load_256d(size_t diff, size_t i, double *d){

    if (diff == 4) {
        return _mm256_load_pd(&d[i]);
    }
    if (diff == 3) {
        return _mm256_set_pd(0.0, d[i+2], d[i+1], d[i]);
    }
    if (diff == 2) {
        return _mm256_set_pd(0.0, 0.0, d[i+1], d[i]);
    }
    return _mm256_set_pd(0.0, 0.0, 0.0, d[i]);

}

What is the (a) canonical solution for a case like this?

s-m-e
  • 3,433
  • 2
  • 34
  • 71
  • 2
    Related: [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) and [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) (yes). Padding your allocations to a multiple of 4 doubles is often good, especially if you also align by 32 (so use `aligned_alloc` instead of `malloc`). – Peter Cordes Apr 19 '23 at 09:06
  • You're correct that branching like this is not great, especially with those implementations if you compiler doesn't simplify the 1 and 2 element cases to just `vmovsd` or `vmovups xmm`. But you normally want to keep that size calc out of the inner loop and do a cleanup at the end. Possibly with a partial vector, or a final unaligned vector that ends at the last element of your array, if its total size is >= 4 and your code can safely re-process elements. (e.g. copy-and-operate to a separate destination.) – Peter Cordes Apr 19 '23 at 09:11
  • the CPU doesn't care if you access out of bounds as long as the out-of-bounds is within the same page (which, being in the same 16-byte group, it is) – user253751 Apr 19 '23 at 09:48
  • 1
    AVX vectors are 32 bytes, not 16. Out of interest, who decides what is the page size? Is it the CPU, the c standard, or something else? I would argue it's bad practice to rely on the CPUs ability to process memory that is out of bounds. Could it be allocated to another variable? – Simon Goater Apr 19 '23 at 10:35
  • it is the CPU which decides the page size but it is also the CPU which decides how big your vectors are and which instructions are available. Once you are using something like AVX, portability seems to go out the window – user253751 Apr 19 '23 at 11:23
  • 1
    @SimonGoater: In the case of x86, it's determined by the ISA. The same ISA that makes AVX available, which this code uses via intrinsics. On some other ISAs like AArch64, the OS may have a choice of page sizes (so software might just check for 4k boundaries to be conservative, as that's the minimum), or in embedded systems might operate with paging disabled. (That's literally impossible for x86-64, but possible in 32-bit mode.) – Peter Cordes Apr 19 '23 at 11:47

1 Answers1

2

For reads, and assuming the start of the overall vector is aligned, one simply reads the entire SIMD block and ignores the undesired elements. The hardware design is such that if the first byte of a block is readable, all bytes of the block are readable (because the pages used to map and protect memory are aligned to boundaries at least as large as the SIMD block alignments).

For writes, there is no canonical answer; there are multiple options depending on circumstances, including:

  • Require the calling software to provide padding so that whole-block stores can always be performed even if only one element is used in the last block.
  • Use an instruction that stores with a mask to specify which elements are updated.
  • Write code separate from the main loop to handle the last block using instructions for scalar elements or partial blocks (e.g., 16-byte SIMD instructions instead of 32-byte instructions).
  • Use an unaligned store to store a whole block that ends where the destination vector ends. This will overlap elements in the prior block, so they can either be stored twice (if the computation permits) or merged (load, permute as necessary, store). (This also requires taking care to handle the case where the entire vector is less than a full block.)
  • If the application is single-threaded (so it is known no other code that could be writing to the same block is executing), read the last block, merge in the changed elements, and write the last block.
Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312