All the string instructions that use EDI use ES:EDI. (or di or rdi)
Explicit addressing modes using EDI (like [edi]) default to DS, but movs/stos/scas/cmps (with/without rep/repz/nz) all use es:edi. lods only uses ds:esi. (rep lods "works", but is rarely useful. With cx=0 or 1 it can work as a slow conditional load, because unlike loop, rep checks cx before decrementing.)
Note that even though scas is read-only, it uses (r|e)di. This makes it work well with lods: load from one array with lods, the scas to compare against a different array. (Optionally with some kind of processing of (r|e)ax before the compare).
Normally when you can use 32-bit addresses, you have a flat memory model where all segments have the same base and limit. Or if you're making a .COM flat binary with NASM, you have the tiny real-mode memory model where all segments have the same value. See @MichaelPetch's comments on this answer and on the question. If your program doesn't work without setting ES, you're doing something weird. (like maybe clobbering es somewhere?)
Note that rep movsb in 16-bit mode without an address-size prefix uses CX, DS:SI, and ES:DI, regardless of whether you used operand-size prefixes to write edi instead of di.
Also note that rep string instructions (and especially the non-rep versions) are **often not the fastest way to do things. They're good for code-size, but often slower than SSE/AVX loops.
rep stos and rep movs have fast microcoded implementation that store or copy in chunks of 16 or 32 bytes (or 64 bytes on Skylake-AVX512?). See Enhanced REP MOVSB for memcpy. With 32-byte aligned pointers and medium to large buffer sizes, they can be as fast as optimized AVX loops. With sizes below 128 or 256 bytes on modern CPUs, or unaligned pointers, AVX copy loops typically win. Intel's optimization manual has a section on this.
But repne cmpsb is definitely not the fastest way to implement memcmp: use SSE2 or AVX2 SIMD compares (pcmpeqb), because the microcode still only compares a byte at a time. (Beware of reading past the end of the buffer, especially avoid crossing a page (or preferably cache line) boundary.) Anyway, repne / repe don't have "fast strings" optimizations in Intel or AMD CPUs, unfortunately.