Summing vec4[idx[i]] * scalar[i] with YMM vector registers

Question

I am trying to optimize the following sum{vec4[indexarray[i]] * scalar[i]}, where vec4 is a float[4], and scalar is a float. With 128 bit registers, this boils down to

sum = _mm_fmadd_ps(
            _mm_loadu_ps(vec4[indexarray[i]]),
            _mm_set_ps1(scalar[i]),
            sum);

If I want to execute the FMA on 256 bit registers, I would have to do something like

__m256 coef = _mm256_set_m128(
                    _mm_set_ps1(scalar[2 * i + 0]),
                    _mm_set_ps1(scalar[2 * i + 1]));
__m256 vec = _mm256_set_m128(
                    _mm_loadu_ps(vec4[indexarray[2 * i + 0]]),
                    _mm_loadu_ps(vec4[indexarray[2 * i + 1]]));
sum = _mm256_fmadd_ps(vec, coef, sum);

along with a shuffle and add at the end to sum the upper and lower lanes.

Theoretically, I gain 5 in latency (assuming Haswell architecture) from the single FMA, but lose 2x3 in latency from the _mm256_set_m128.

Is there a way to make this any faster using the ymm registers or are all gains from the single FMA going to offset with interest from combining the xmm registers?

Intel's MKL library has a function for exactly that operation: cblas?_ddoti, see https://software.intel.com/en-us/mkl-developer-reference-c-cblas-doti. I would suspect that this function is highly optimized by Intel. So you could either use it directly or look at its (assembly) source -- provided you have an Intel compiler license. — Daniel Junglas, May 15 '19 at 07:37
Doesn't `sdoti` need the `x` and `y` arrays to have the same number of elements? in my case, the number of elements in `x` is a quarter of that in `y`. — Avi Ginsburg, May 15 '19 at 07:52
No, the function I pointed to is the *sparse* version. You can see this in the "Description" section of the page. One vector is dense and the other vector is sparse. I think the function performs *exactly* your operation. — Daniel Junglas, May 15 '19 at 10:16
How large is `vec4`, or what is the range of `indexarray`? It might be more efficient to first sum up `scalar[i]` into a temporary array (each value at `temp[indexarray[i]]`), and compute a simple matrix-vector-product at the end -- unless the entries in `indexarray` are very sparse. — chtz, May 15 '19 at 15:30
@chtz `vec4` is in the thousands (less than 10k). `indexarray` is sparse (I haven't an exact number to give you, but my intuition is about 5-10% full), but the indices tend to come clumped together. — Avi Ginsburg, May 16 '19 at 09:35
@DanielJunglas I think what Avi actually want is a Dense `4 x n` matrix times sparse vector product. Followup questions to Avi: are there duplicates in `indexarray`? Are the entries sorted? If values are clumped together, is it an option to store them in blocks of 2,4, or 8? (e.g., have one start-index `i` combined with values `s[i]` to `s[i+7]`) This would allow hand-tuned fixed-sized matrix-vector products in the inner loop. — chtz, May 21 '19 at 08:53
@chtz Close. It's a dense 4xN matrix times a sparse matrix, one row at a time (effectively a sparse vector, as you stated). The `indexarray` does not contain duplicates and the indices are monotonically increasing. I'm not sure which "them" you want to store in blocks. The scalars are stored sequentially, the rows of the 4xN matrix are skipped based in the indices. — Avi Ginsburg, May 21 '19 at 10:14
@AviGinsburg I meant storing the entries of your sparse vector/matrix in blocks — chtz, May 21 '19 at 11:26
@chtz As in convert the matrix to BSR format? Possibly, I'm not sure how much additional refactoring that will involve. — Avi Ginsburg, May 21 '19 at 11:44
Yes that would need to be done during construction of the sparse matrix. But this diverges from your original question ... — chtz, May 21 '19 at 16:04
@chtz yes, very much not my original question, although a very valid and relevant point. — Avi Ginsburg, May 21 '19 at 16:08

score 3 · Accepted Answer · answered May 15 '19 at 08:04

3

but lose 2x3 in latency from the _mm256_set_m128

No, that latency isn't on the critical path; it's part of preparing an input for FMA. The concern with doing more shuffles for each independent i value is throughput.

Latency only really matters for the loop carried dependency chain through sum, which is both an input and output from the FMA.

The inputs that only depend on i and memory contents can be handled in parallel across multiple iterations by out-of-order execution.

You might still come out ahead, though, building 256-bit vectors. However you write the source (_mm256_set_m128 isn't a real instruction), it will probably bottleneck on the front-end or on 1-per-clock shuffle throughput. You want it to compile to a 128-bit load and then vinsertf128 ymm, ymm, [mem], 1 to insert the high half of a vector. vinsertf128 does cost a shuffle uop.

If you bottleneck on latency with 128-bit registers, you might be better off just using multiple accumulators so (on Haswell) up to 10 FMAs can be in flight at once: 5c latency * 0.5c throughput. Add them at the end. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?

answered May 15 '19 at 08:04

Peter Cordes

328,167
45
605
847

Doesn't [`_mm256_set_m128`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_set_m128&expand=4912) compile to `vinsertf128`? – Avi Ginsburg May 15 '19 at 08:12
@AviGinsburg: It will in this case, with a `vmovups` first. If both inputs happened to be contiguous in memory, it could compile to a single 256-bit load. Or if both inputs are the same, it will (hopefully) compile to a single 128-bit broadcast-load. A compiler could also choose to compile it as `vperm2f128` if both inputs were in registers, but usually `vinsertf128` would be best for that case, too. – Peter Cordes May 15 '19 at 08:13
Thanks. I decided to use your Y solution to my X problem (opting for the multiple accumulators). Not that that's mutually exclusive of the issue I asked about, but I'm more likely to hit the inevitable cache miss issues that way. – Avi Ginsburg May 15 '19 at 09:08
@AviGinsburg: yeah, you could even do both, like a mix of 2x `__m128` and 1x `__m256` or something and unroll by 4x `vec4`. But 128-bit vectors can compile *very* efficiently, with a `vbroadcastss` for the scalar load, and a memory source operand for the FMA. (Or a 128-bit load from `scalar[i]` that you shuffle 4 ways, to reduce pressure on the load ports in case you don't bottleneck on cache misses, or possibly increasing memory parallelism by helping have gather load addresses ready sooner.) – Peter Cordes May 15 '19 at 09:46
I opted for 8 accumulators. If I end up getting a boost of x4 or more, I'll combine the `__m256`. Although, I also like you single load with four shuffles idea, so I might add that first. – Avi Ginsburg May 15 '19 at 09:51

Summing vec4[idx[i]] * scalar[i] with YMM vector registers

1 Answers1