How could warp shuffling be useful in matrix multiplication except for loading data to and from register variables?

Question

A common answer to the question what is warp shuffling found on the Internet is

Warp shuffling is a technique to exchange values between two threads in a warp.

This definition is confusing at its best, as this is impossible to visualize.

What I understood from the texts available on the Internet is:

Warp shuffling is a technique to load data from a kernel argument (an array, a vector, or a matrix) to a register variable, thereby eliminating the need for using shared memory and being considerably faster than device memory.

Also,

There are some API functions in CUDA related to warp shuffling that allow us to -
- load data from one variable to multiple variables (called broadcasting)
- shift value from left to right or right to left (called shifting)
- exchange values in a cross-fasion (called butterfly exchange)

However, I don't see the utility of warp shuffling in matrix multiplication except for loading data to and from the register variable.

How could warp shuffling be useful in matrix multiplication except for loading data to and from register variables?

Reading something from global memory is not a warp shuffle. If another thread of the same warp has already read the same data, it can shuffle it from its register into a register of the thread that needs the data instead of reading it from global memory multiple times. — paleonix, Aug 16 '23 at 10:55
I.e. I find the first definition much less confusing than the second. — paleonix, Aug 16 '23 at 10:57

0 Answers0