0

I need to implement a simple matrix multiplication of nxn matrices A*B=C in C using SSE. The matrices are represented as one-dimensional float arrays. The problem is that _mm_load_ps() only takes a single pointer as argument and loads four floats from that address on. For matrix A this is ok because the values are located next to each other. But for matrix B I need to give _mm_load_ps() four pointers to load values which are distributed in my vector representation. I also want to avoid coying the four values from the vector in a temporary array. Is there a simple way (like a SSE function) to do this? Thank you

Eli Duenisch
  • 516
  • 3
  • 14
  • 2
    It sounds like you're wanting Vector Gather commands, which weren't added to x86 until AVX. Any optimal SSE solution needs to use a mixture of shuffles and loads. This has some useful information. https://software.intel.com/en-us/forums/topic/285867 – BlamKiwi Nov 26 '14 at 21:23
  • Sorry - Shuffles, Loads, Inserts and Extracts. – BlamKiwi Nov 26 '14 at 21:39
  • 2
    Based on what you have described, you're probably going to run into cache locality issues. So SSE may not help much because you will be cache missing all over the place on matrix B. – Mysticial Nov 26 '14 at 21:47
  • Thank you for your answers. I solved it now by using a different algorithm which does not need to pick distributed values from the array. – Eli Duenisch Nov 27 '14 at 08:53

0 Answers0