This limitation comes from the 8086, so we have to go all the way back there to try to guess at the rationale.
One possible hint comes from the encoding. As you probably know, most 8086 ALU instructions consist of a one-byte opcode followed by a Mod-Reg-R/M byte which specifies the operands, one register and one R/M that can be register or memory. The register operand is specified in the 3-bit Reg field of this byte, and the R/M operand by the remaining 5 bits.
In the case of shift, we actually have 7 different instruction sharing an opcode. Opcode D2 is used for all the 8-bit variable shift and rotate instructions:
rol r/m8, cl
ror r/m8, cl
rcl r/m8, cl
rcr r/m8, cl
shl r/m8, cl
shr r/m8, cl
sar r/m8, cl
Since the cl operand for the shift count is hardcoded, the Reg field of the Mod-Reg-R/M byte is not needed for a register operand, and can be used instead to distinguish between these instructions. This is notated in instruction listings like
D2 /2 rcl r/m8, cl
in other words, rcl r/m8, cl uses opcode D2 and the value 2 in the 3-bit Reg field.
(shl r/m8, cl actually has two encodings, D2 /4 and D2 /6. Probably bit 1 is used to indicate logical or arithmetic shift, because shr and sar are D2 /5 and D2 /7 respectively. So in some sense D2 /4 is probably shl and D2 /6 is sal, but those are actually the same operation.)
The 16-bit variable shift and rotates are the same, but using opcode D3.
So by hardcoding the shift count register, the instruction set designers got to encode 7 instructions while using up only one available opcode. Had they allowed an arbitrary register to be used, they'd have needed 7 different opcodes for the 8-bit shift and rotate instructions, and 7 more for the 16-bit versions. The opcode map is pretty crowded. I suppose 60-6F were available at the time, but they probably wanted to reserve them for later use. Otherwise they would have had to drop instructions somewhere else, or adopting a more complicated encoding scheme, which would have meant spending more transistors on the decoder.
There remains the question of why they chose cl in particular, instead of say bl. They did have a general philosophy of using cx as a "count" register, and the other instructions that hardcode cx tend to use it as some sort of count: loop and rep for instance. They might have thought that would make it easier for programmers to remember.
As noted by fuz, the much later BMI2 extension did add shlx/shrx/sarx, which can take their shift count from an arbitrary register, but they have to be shoehorned into an odd corner of the instruction space and have a much more complicated encoding - needing a VEX prefix and a total of 5 bytes or more to encode.