With only SSE2, loading the full pattern from memory is generally your best bet.
In your NASM source you can use times 16 db 0x20 for easy maintainability.
With SSE3 you can do 8-byte broadcast loads with movddup. With AVX you can do a 4-byte broadcast-load with vbroadcastss. These broadcast-loads are very good on modern CPUs, running on just the load port, not needing a shuffle uop. i.e. they're exactly as cheap as movaps on CPUs that support them, except for a byte or two more code-size. Same for vbroadcastf128 to YMM registers.
Most compilers don't seem to realize this and will do constant-propagation through _mm_set1 even when that results in a 32 byte constant instead of 4 bytes, even when just mov... loading it ahead of a loop, not folding it into a memory operand for an ALU instruction. (And that's still possible with broadcast-loading when AVX512 is available.) Clang does sometimes take advantage of broadcast loads for simple constants.
AVX2 adds vpbroadcastb/w/d/q, but only dword and qword are pure load uops. Byte and word broadcast-loads need an ALU shuffle uop, so for constant byte patterns you probably want to just broadcast-load a dword that repeats a byte 4 times. (Unless it's an element from a big lookup table, then compress the table by using a byte or word broadcast load, or a pmovsx sign-extending load or whatever).
AVX512 adds vpbroadcastb/w/d/e from an integer register so you could mov eax, 0x20202020 / vpbroadcastd xmm0, eax if you have AVX512VL.
With SSE2 it would take at least 2 instructions including an ALU shuffle, like this, and may not be worth it.
movd xmm0, [const_4B]
pshufd xmm0, xmm0, 0
Some repeating constants can be generated on the fly in a couple instructions, starting with all-ones from pcmpeqd xmm0,xmm0. See What are the best instruction sequences to generate vector constants on the fly? and Agner Fog's guide.
This pattern does not appear to be easy to generate. It's a byte pattern (not word, dword, or qword) and SSE shifts are only available with word granularity at best. However, if we know the bits shifted across byte boundaries are 0, it's fine. e.g.
pcmpeqd xmm0, xmm0 ; set1( -1 )
pabsb xmm0, xmm0 ; set1_epi8(1) SSSE3
pslld xmm0, 5 ; set1_epi8(1<<5)
; or with only SSE2, something even less efficient like shift / packsswb / shift
This is unlikely to be worth it unless you really want to avoid the possibility of a cache miss for the constant. On average a load will usually come out ahead.