7

Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions.

Thanks in advance.

starblue
  • 55,348
  • 14
  • 97
  • 151
alecco
  • 2,914
  • 1
  • 28
  • 37

4 Answers4

7

Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms" by Furtak, Amaral, and Niewiadomski. Section 4.

It uses 4 SSE registers, has 12 steps, and runs in 19 instructions including load and store.

The same paper has some excellent work on dynamically making sorting networks with SIMD.

alecco
  • 2,914
  • 1
  • 28
  • 37
6

Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.)

Darius Bacon
  • 14,921
  • 5
  • 53
  • 53
  • Thanks. I'm looking for an asm efficient solution. Oh, please note I said a "few instructions" and not a "few cycles" ;) – alecco Oct 16 '09 at 23:15
  • Ah, I see that the paper you linked to takes just this approach, using SSE2 instructions. Cool. – Darius Bacon Oct 17 '09 at 04:06
  • Yeah, I didn't want to be too verbose, as I was hoping for some sort of bit hack magic with asm. In fact I was looking for this reading "Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture" (Chhugani,.. 2008), but got frustrated with the instructions for the algorithm: 1) a) Perform In-Register Sort to obtain sorted sequences of length K. I guess for researchers at Intel that's a "duh" procedure, but not for me! (I'm still not sure they do the whole 17-19 instruction procedure to sort a register.) [Note: sorry, didn't up-vote you because of lack of karma] – alecco Oct 17 '09 at 08:28
  • 1
    I learned something from a skim of the 2007 paper -- reward enough. :-) – Darius Bacon Oct 17 '09 at 19:23
  • By the way, there's a very efficient simultaneous 4 register sort on the original (2008) paper. In my face, my bad. – alecco Oct 19 '09 at 06:51
4

To speed up sorting of strings, I ended up packing 7 bytes per double and sorting (ranking) an array of 16 doubles in SSE2, using bitonic sort to create two runs of 8, and a binary merge to merge the two runs. You can see the first part here http://mischasan.wordpress.com/2011/07/29/okay-one-more-poke-at-sse2-sorting-doubles/ (asm) and here http://mischasan.wordpress.com/2011/09/02/update-on-bitonic-sse2-sort-of-16-doubles/ (C), and the bitonic merge step (if you want to go SSE all the way) here: http://mischasan.wordpress.com/2012/11/04/sse2-odd-even-merge-the-last-step-in-sorting/ . I replaced the insertion sort at the bottom of qsort with this sort, and it's about 5 times as fast as straight qsort. HTH

I hadn't seen the UofA paper; the bitonic logic is from old school (CTM) GPGPU programming.

Sorry about the embedded link strings; I don't know how to add clickable links in comments stackoverflow.

Mischa
  • 2,240
  • 20
  • 18
1

All sorting algorithms require "swapping" values from one place to another. Since you're talking about a literal CPU register, that means any sort would need another register to use as a temporary place to hold the bytes being swapped.

I've never seen a chip with a built-in method for sorting bytes within a register. Not saying it hasn't been done, but I can't think of many uses for such an instruction.

richardtallent
  • 34,724
  • 14
  • 83
  • 123
  • I meant as sort the bytes in a register, of course have to use at least another register. Sorry for the misunderstanding. – alecco Oct 16 '09 at 22:24
  • Actually there is a way for in-register sorting using CMPXCHG using eax register and rotating it, as a friend who is quite knowledgeable in x86 showed me. Little gain from it, but it is possible. Also CMPXCHG is quite slow. – alecco Oct 21 '09 at 20:55
  • 1
    All SIMD architectures that I've used have such instructions. – alex strange Nov 04 '09 at 00:10