Does any memcpy() implementation use multiple processor registers?

Question

memcpy() is to my knowledge usually implemented as a loop:

// Pseudo code - for illustration only
while(len--)
  ++*dst=++*src;

Would it not make more sense to use all available CPU registers?! At least for large copies?!

// Pseudo code - for illustration only
register srcA,dstA
register srcB,dstB
register srcC,dstC

while(len-=numreg)
{
  *dstA=*srcA;
  *dstB=*srcB;
  *dstC=*srcC;
}

So the question is. Does memcpy() implementations take available registers into account specifically or are this left to the compiler?!

If you want to know, use memcy() and have the compiler generate a .S file for you. Or trace it in the debugger in disassembly. I bet you will see all sorts of registers being used. Honestly, you pretty much never want to write code like you have. Use std::memcpy() and let the optimizer go to town, or assign one structure to another and let the optimizer handle that with registers or a memcpy(). It judges pretty well in these cases - better than almost any engineer could honestly. — Michael Dorgan, Apr 03 '18 at 00:09
"// Pseudo code - for illustration only" => "// Unreadable code - for illustration only". This question don't make sense. You think that memcpy a core function that EVERY PROGRAM IN THE WORLD use is not optimized in the standard library that use the compiler ? — Stargateur, Apr 03 '18 at 00:11
I could read it. Poorly written, but I got the idea behind it... — Michael Dorgan, Apr 03 '18 at 00:11

sg7 · Accepted Answer · 2018-04-03T00:51:10.333

1

Would it not make more sense to use all available CPU registers?! At least for large copies?!

True.

The fastest implementation would be coded in assembler with the use of the registers:

   void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
    {

      __asm
      {
        mov esi, src;    //src pointer
        mov edi, dest;   //dest pointer

        mov ebx, size;   //ebx is our counter 
        shr ebx, 7;      //divide by 128 (8 * 128bit registers)


        loop_copy:
          prefetchnta 128[ESI]; //SSE2 prefetch
          prefetchnta 160[ESI];
          prefetchnta 192[ESI];
          prefetchnta 224[ESI];

          movdqa xmm0, 0[ESI]; //move data from src to registers
          movdqa xmm1, 16[ESI];
          movdqa xmm2, 32[ESI];
          movdqa xmm3, 48[ESI];
          movdqa xmm4, 64[ESI];
          movdqa xmm5, 80[ESI];
          movdqa xmm6, 96[ESI];
          movdqa xmm7, 112[ESI];

          movntdq 0[EDI], xmm0; //move data from registers to dest
          movntdq 16[EDI], xmm1;
          movntdq 32[EDI], xmm2;
          movntdq 48[EDI], xmm3;
          movntdq 64[EDI], xmm4;
          movntdq 80[EDI], xmm5;
          movntdq 96[EDI], xmm6;
          movntdq 112[EDI], xmm7;

          add esi, 128;
          add edi, 128;
          dec ebx;

          jnz loop_copy; //loop please
        loop_copy_end:
      }
    }

Source: Very fast memcpy for image processing?

Blog: Improving memcpy for large memory copies

How to increase performance of memcpy

edited Apr 03 '18 at 00:51

answered Apr 03 '18 at 00:04

sg7

6,108
2
32
40

kind of curious how well this works if the `memcpy()` is for 4 bytes? – Richard Chambers Apr 03 '18 at 00:09
And the question was about memcpy function. This one does not care about saving the register conten on the stack. Standard function cannot be implemented this way – 0___________ Apr 03 '18 at 00:12
Fastest for which processor models? You know that the dedicated processor instruction is periodically overhauled, and then does the most efficient thing? – Deduplicator Apr 03 '18 at 00:20
@RichardChambers Thank you for your comment. I doubt that it fast for 4 bytes. But the OP asks for large copies. – sg7 Apr 03 '18 at 00:35
@Deduplicator You are very right! It does depend on the processor. This one is for Intel. The source claims 30-70% gain over `memcpy` in Microsoft Visual Studio 2005. – sg7 Apr 03 '18 at 00:40
@PeterJ_01 Thank you Peter. This one is specialized for copying big blocks. It requires 16-byte aligned memory and it copies in 128-byte blocks. – sg7 Apr 03 '18 at 00:46

score 0 · Answer 2 · answered Apr 03 '18 at 00:07

First of all your pseudo code is wrong as you forgot to increase the pointers. When you consider it your optimisation stops to make any sense.

Another problem is that you can't copy any number of bytes which is a must for any standard function.

You can of course write the highly optimised function for fast memory moves using specific processor features, but it will be barely implementable as a replacement of the standard memcpy function

Does any memcpy() implementation use multiple processor registers?

2 Answers2