If you really want to know about register values, rather than __m128i C variable values, I'd suggest using a debugger like GDB. print /x $xmm0.v2_int64 when stopped at a breakpoint.
Capturing a register at the top of a function is a pretty flaky and unreliable thing to try to attempt (smells like you've already gone down the wrong design path)1. But you're on the right track with a register-asm local var. However, xmm0 can't match an "=r" constraint, only "=x". See Reading a register value into a C variable for more about using an empty asm template to tell the compiler you want a C variable to be what was in a register.
You do need the asm volatile("" : "=x"(var)); statement, though; GNU C register-asm local vars have no guarantees whatsoever except when used as operands to asm statements. (GCC will often keep your var in that register anyway, but IIRC clang won't.)
There's not a lot of guarantee about where this will be ordered wrt. other code (asm volatile may help some, or for stronger ordering also use a "memory" clobber). Also no guarantee that GCC won't use the register for something else first. (Especially a call-clobbered register like any xmm reg.) But it does at least happen to work in the version I tested.
print a __m128i variable shows how to print a __m128i as two 64-bit halves once you have it, or as other element sizes. The compiler will often optimize _mm_store_si128 / reload into shuffles, and this is for printing anyway so keep it simple.
Using a unsigned __int128 tmp; would also be an option in GNU C on x86-64.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h>
#endif
// If you need this, you're probably doing something wrong.
// There's no guarantee about what a compiler will have in XMM0 at any point
void foo() {
register __m128i xmm0 __asm__("xmm0");
__asm__ volatile ("" :"=x"(xmm0));
alignas(16) uint64_t buf[2];
_mm_store_si128((__m128i*)buf, xmm0);
printf("%llu %llu\n", buf[1], buf[0]); // I'd normally use hex, like %#llx
}
This prints the high half first (most significant), so reading left to right across both elements we get each byte in descending order of memory address within buf.
It compiles to the asm we want with both GCC and clang (Godbolt), not stepping on xmm0 before reading it.
# GCC10.2 -O3
foo:
movhlps xmm1, xmm0
movq rdx, xmm0 # low half -> RDX
mov edi, OFFSET FLAT:.LC0
xor eax, eax
movq rsi, xmm1 # high half -> RSI
jmp printf
Footnote 1:
If you make sure your function doesn't inline, you could take advantage of the calling convention to get the incoming values of xmm0..7 (for x86-64 System V), or xmm0..3 if you have no integer args (Windows x64).
__attribute__((noinline))
void foo(__m128i xmm0, __m128i xmm1, __m128i xmm2, etc.) {
// do whatever you want with the xmm0..7 args
}
If you want to provide a different prototype for the function for callers to use (which omits the __m128i args), that can maybe work. It's of course Undefined Behaviour in ISO C, but if you truly stop inlining, the effects depend on the calling convention. As long as you make sure it's noinline so link-time optimization doesn't do cross-file inlining.
Of course, the mere fact of inserting a function call will change register allocation in the caller, so this only helps for a function you were going to call anyway.