Use pxor with an all-ones register.
pandn is also usable, but has zero advantages. There are no cases where pandn with an all-ones constant lets you do anything you couldn't have done with pxor.
psubd is also usable (2's complement identity), but is even worse than pandn because it has lower throughput on some CPUs (fewer execution ports).
pcmpedq xmm1, xmm1 ; create the all-ones. No false dependency.
pxor xmm0, xmm1 ; flip all the bits in XMM0. Doesn't destroy XMM1
;pandn xmm0, xmm1 ; equivalent but no advantage. (~xmm0) & xmm1
PXOR is nice because it's commutative. With AVX, you can load-and-NOT with one micro-fused uop:
vpxor xmm0, xmm1, [rdi]
You can't do that with VPANDN, because the operand that can be memory or register is the not-inverted operand. (Without AVX, though, just movdqa or dqu load and then pxor the load result. A reg-copy and a micro-fused load+pxor is 3 total unfused-domain uops vs. 2)
Or without AVX, if you want to destroy the all-ones constant instead of the data you're inverting, pxor wins again:
movdqa xmm2, xmm1 ; copy the all-ones constant. Off the critical path for latency
pxor xmm2, xmm0
you can take a movdqa off the critical path vs. movdqa xmm2, xmm0 / pandn xmm2,xmm1. (Only IvyBridge+ and Bulldozer-family/Ryzen have zero-latency movdqa for vector regs.) Or if you were rematerializing the all-ones every time with pcmpeqd in the target register (maybe because of register pressure or because you're not doing it in a loop), that would be another case where you want pxor instead of pandn.
Generating an all-ones constant with pcmpeqb/w/d is special-cased to not have a false dependency on the old value (except on Silvermont where it does), but does still need an execution unit (unlike xor-zeroing on Sandybridge-family). Still, it's cheap, and it's what compilers use for _mm_set1_epi32(-1).
Re-creating the constant every time you need it instead of copying from another register is slightly worse on IvyBridge and later, and on Bulldozer-family and Ryzen. mov-elimination for XMM copies avoids occupying a vector execution unit / port, in case vector-ALU execution ports were your bottleneck.
But it's slightly better on Intel P6-family (Core2/Nehalem): register-read stalls can be a problem when reading too many "cold" registers in an issue group. (See Agner Fog's microarch pdf https://agner.org/optimize/). P6-family is obsolete but still in use in some old machines. You might want to tune for it in the non-AVX version of your code if you have an AVX version which runs on CPUs with AVX. (But Haswell/Skylake "pentium" / "celeron" are still a thing, and they don't have AVX, so no-AVX doesn't mean old CPU.)