1

How to get bitwise negation of values in XMM register? As far as I know there is no such instruction. The only instruction with negation is pandn, but to use it to simply negate values in one XMM register, I would have to have another XMM register padded filled with 1.

Is there another way to negate bits in XMM register? Or is there a clever way to fill XMM register with 1 without accessing memory?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Rames
  • 918
  • 11
  • 27

2 Answers2

6

To load a register with all 1s, use

pcmpeqd xmm0, xmm0

After that you can simply subtract xmmX from xmm0 to get ~xmmX, or use pandn

You can also load other constants to xmm registers easily

pcmpeqd xmm0, xmm0
psrld   xmm0, 30   ; 3 (32-bit)

pcmpeqd xmm0, xmm0 ; -1

pcmpeqw xmm0, xmm0 ; 1.5f
pslld   xmm0, 24
psrld   xmm0, 2

pcmpeqw xmm0, xmm0 ; -2.0f
pslld   xmm0, 30

Read Agner Fog's optimization guide, 13.4 Generating constants - Making constants for integer vectors in XMM registers

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 2
    ...or `pxor` with the all-ones register. – EOF Jan 19 '16 at 16:46
  • `psubd` and `pandn` have zero advantage over `pxor`. Worse throughput for `psubd`, and neither are commutative. – Peter Cordes Sep 24 '18 at 03:30
  • [Constant floats with SIMD](https://stackoverflow.com/q/6565556/995714), [What are the best instruction sequences to generate vector constants on the fly?](https://stackoverflow.com/q/35085059/995714) – phuclv Aug 26 '19 at 04:40
5

Use pxor with an all-ones register.

pandn is also usable, but has zero advantages. There are no cases where pandn with an all-ones constant lets you do anything you couldn't have done with pxor.

psubd is also usable (2's complement identity), but is even worse than pandn because it has lower throughput on some CPUs (fewer execution ports).


pcmpedq  xmm1, xmm1      ; create the all-ones.  No false dependency.

pxor     xmm0, xmm1      ; flip all the bits in XMM0. Doesn't destroy XMM1
;pandn    xmm0, xmm1      ; equivalent but no advantage.  (~xmm0) & xmm1

PXOR is nice because it's commutative. With AVX, you can load-and-NOT with one micro-fused uop:

vpxor    xmm0, xmm1, [rdi]

You can't do that with VPANDN, because the operand that can be memory or register is the not-inverted operand. (Without AVX, though, just movdqa or dqu load and then pxor the load result. A reg-copy and a micro-fused load+pxor is 3 total unfused-domain uops vs. 2)


Or without AVX, if you want to destroy the all-ones constant instead of the data you're inverting, pxor wins again:

movdqa  xmm2, xmm1      ; copy the all-ones constant.  Off the critical path for latency
pxor    xmm2, xmm0

you can take a movdqa off the critical path vs. movdqa xmm2, xmm0 / pandn xmm2,xmm1. (Only IvyBridge+ and Bulldozer-family/Ryzen have zero-latency movdqa for vector regs.) Or if you were rematerializing the all-ones every time with pcmpeqd in the target register (maybe because of register pressure or because you're not doing it in a loop), that would be another case where you want pxor instead of pandn.


Generating an all-ones constant with pcmpeqb/w/d is special-cased to not have a false dependency on the old value (except on Silvermont where it does), but does still need an execution unit (unlike xor-zeroing on Sandybridge-family). Still, it's cheap, and it's what compilers use for _mm_set1_epi32(-1).

Re-creating the constant every time you need it instead of copying from another register is slightly worse on IvyBridge and later, and on Bulldozer-family and Ryzen. mov-elimination for XMM copies avoids occupying a vector execution unit / port, in case vector-ALU execution ports were your bottleneck.

But it's slightly better on Intel P6-family (Core2/Nehalem): register-read stalls can be a problem when reading too many "cold" registers in an issue group. (See Agner Fog's microarch pdf https://agner.org/optimize/). P6-family is obsolete but still in use in some old machines. You might want to tune for it in the non-AVX version of your code if you have an AVX version which runs on CPUs with AVX. (But Haswell/Skylake "pentium" / "celeron" are still a thing, and they don't have AVX, so no-AVX doesn't mean old CPU.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • See also [Is NOT missing from SSE, AVX?](https://stackoverflow.com/a/42616203) for intrinsics, and AVX-512F `vpernlogd` which gets the job done with no vector constant. – Peter Cordes Jan 08 '22 at 09:22