A common answer to the question what is warp shuffling found on the Internet is
Warp shuffling is a technique to exchange values between two threads in a warp.
This definition is confusing at its best, as this is impossible to visualize.
What I understood from the texts available on the Internet is:
- Warp shuffling is a technique to load data from a kernel argument (an array, a vector, or a matrix) to a register variable, thereby eliminating the need for using shared memory and being considerably faster than device memory.
Also,
There are some API functions in CUDA related to warp shuffling that allow us to -
- load data from one variable to multiple variables (called broadcasting)
- shift value from left to right or right to left (called shifting)
- exchange values in a cross-fasion (called butterfly exchange)
However, I don't see the utility of warp shuffling in matrix multiplication except for loading data to and from the register variable.
How could warp shuffling be useful in matrix multiplication except for loading data to and from register variables?