vpbroadcastq zmm26{k5}{z},rax is an interesting hack; could be useful if it runs efficiently. Especially with merge-masking as an alternative to vmovq / vpinsrq.
There is no single-instruction inverse to this (ab)use of vpbroadcastq, except for elements 0 or 1: vmovq rax, xmm26 or vpextrq rax, xmm26, 1. Yes, there are EVEX encodings for those instructions that let them access xmm16-31, in AVX512F and AVX512DQ respectively. If your data is in xmm0-15, you can use the shorter VEX-encoded version.
However, you could abuse VPCOMPRESSQ zmm1/m512 {k5}{z}, zmm26 to do what you want with a memory or zmm destination, using the same single-set-bit mask register that you used for vpbroadcast. But it's not as fast as other options, so the only advantage is using the same mask-register as a shuffle control, saving work if you can't hoist setup out of a loop.
On KNL, VPCOMPRESSQ (with a register destination) has one per 3 cycle throughput (according to Agner Fog's testing). On Skylake-AVX512, it's one per 2 cycles, with 3c latency. Both of those CPUs run vpermq at 1 per cycle, so it probably causes less interference with other instructions. I haven't found timings for the memory-destination version of vpcompressq.
Going the other direction without a store/reload requires at least one shuffle uop, and a separate uop to copy from vector to GP register (like vmovq). (If you eventually want all the elements, a store/reload is probably better than a pure ALU strategy. ALU for the first one or two is probably good, so you have them with low latency so some dependent operations can get started).
If your value is in the low 64b of a 128b "lane" (i.e. an even-numbered element), then vextracti64x2 xmm1, zmm26, 3 / vmovq rax, xmm1 is about as efficient as possible for a single element. The weird name is because the AVX512 version of vextracti128 comes in two flavours of masking granularity. If the element you want is in the 2nd 128b lane of zmm0-15, you can save code-size by using vextracti128 xmm1, ymm6, 1 (only a 3-byte VEX prefix for the AVX2 instruction, not 4-byte EVEX).
But if your value is in the upper 64b of a lane (i.e. an odd-numbered element, counting from 0), you'd need, vpextrq rax, xmm, 1 instead of vmovq, and it decodes (on Skylake) to a shuffle uop and a vmovq uop. (Never use vpextrq rax, xmm, 0, because it wastes a shuffle uop. This is why compilers optimize _mm_extract_epi64(v, 0) to a vmovq.)
For an odd numbered element, you can still do it in one shuffle with vpermq zmm1, zmm2, zmm3/m512/m64bcst + vmovq. If you need to extract in a loop, setting up a shuffle vector constant outside the loop. Or if you need other constants anyway (so there's already a hot cache-line of constants for your function), a broadcast-load memory operand should be fine if not in a loop.
vpermq + vmovq also works when the index is not a compile-time constant, since all you need in a shuffle control vector is to have the index in element 0. e.g. vmovd xmm7, ecx sets you up for vpermq zmm1, zmm2, zmm7 / vmovq rax, zxm1
As @Bee says, store/reload is a good option if you need more than one element. You could also use it if you need a runtime-variable element, since store-forwarding from an aligned 512b store to an aligned 64b reload probably works without stalling. (Still higher latency than the vpermq solution, but uses only memory uops, not ALU. ALU uops may be at a premium in Skylake-AVX512, where port1 won't run any vector uops while there are 512b uops running.)
If your element number is a compile-time constant, you could store only the required 128b lane of the ZMM vector to memory, using vextracti64x2 [rsp-16], zmm26, 3. (Or vextracti128 if it's lane 1.) If you want the value in memory eventually anyway, you could use a mask register with only the 2nd bit set to store just the high element. (But IDK how well that performs if the masked part of the extra goes to an unmapped page. IIRC, it doesn't actually fault, but microarchitecturally it may be slow to handle that. Even crossing a cache-line boundary with the 128b full width could be slow.)
The AVX2 VEXTRACTI128 [mem], ymm, 1 instruction runs on Skylake as just a (non-micro-fused) store, with no shuffle port (http://agner.org/optimize/). AVX512 extract-to-memory is hopefully the same, still not using a shuffle uop. (Throughput / latency Instlatx64 numbers are available, but we don't know what competes with what for which throughput resources, so it's a lot less useful than Agner Fog's instruction tables.)
For KNL, VEXTRACTF32X4 [mem], zmm is 4 uops, with bad throughput, and AVX2 vextracti128 [mem], ymm, imm8 is the same. So (assuming store-forwarding works well) just store the whole 512b vector on KNL.