This is just a bit of a comment and elaboration on @emacs drives me nuts' answer but is too long and complex to be a Stackoverflow comment, therefore this is a Community wiki Stackoverflow answer.
I have compiled 8, 16, 24, and 32 bit versions on godbolt.org with different versions of avr-gcc (12.2.0 and 5.4.0) with the same compile options (-std=c99 -O2 -g -mmcu=atmega8515). It turns out the code produced by avr-gcc 12.2.0 is significantly shorter than what avr-gcc 5.4.0 produces.
As @emacs drives me nuts' answer works off the relatively complicated code from avr-gcc 5.4.0, it may be useful to take a look at the simpler code from 12.2.0 as well.
The uint8_t variant from avr-gcc 5.4.0 appears to be what @emacs drives me nuts' answer works on:
is_udiv8_by_3:
ldi r25,lo8(-85) /* 1 */
mul r24,r25 /* 2 */
mov r25,r1 /* 1 */
clr __zero_reg__ /* 1 */
lsr r25 /* 1 */
mov r18,r25 /* 1 */
lsl r18 /* 1 */
add r25,r18 /* 1 */
ldi r18,lo8(1) /* 1 */
cpse r24,r25 /* 1 */
ldi r18,0 /* 1 */
mov r24,r18 /* 1 */
ret /* 13 cycles plus ret */
uint8_t variant from avr-gcc 12.2.0:
is_udiv8_by_3:
ldi r25,lo8(-85) /* 1 */
mul r24,r25 /* 2 */
mov r25,r0 /* 1 */
clr r1 /* 1 */
ldi r24,lo8(1) /* 1 */
cpi r25,lo8(86) /* 1 */
brlo .L2 /* 2/1 */
ldi r24,0 /* -/1 */
.L2:
ret /* 9 cycles plus ret */
The uint16_t variant from avr-gcc 12.2.0 is not significantly longer than the uint8_t variant from avr-gcc 5.4.0:
is_udiv16_by_3:
ldi r20,lo8(-85) /* 1 */
ldi r21,lo8(-86) /* 1 */
mul r24,r20 /* 2 */
movw r18,r0 /* 2 */
mul r24,r21 /* 2 */
add r19,r0 /* 1 */
mul r25,r20 /* 2 */
add r19,r0 /* 1 */
clr r1 /* 1 */
ldi r24,lo8(1) /* 1 */
cpi r18,86 /* 1 */
sbci r19,85 /* 1 */
brlo .L5 /* 2/1 */
ldi r24,0 /* -/1 */
.L5:
ret /* 18 cycles plus ret */
BTW, the cycle count of the is_div3_asm function from @emacs drives me nuts' answer depends on the input value, and goes over 18 cycles even for a single iteration of the .Loop_bits loop.
When increasing the type size to 24bit (the __uint24 variant) and 32bit, avr-gcc 12.2.0 finally starts calls a division function __udivmodpsi4:
is_udiv24_by_3:
ldi r18,lo8(3)
ldi r19,0
ldi r20,0
rcall __udivmodpsi4
ldi r24,lo8(1)
or r18,r19
or r18,r20
breq .L7
ldi r24,0
.L7:
ret
The uint32_t calls a different division function __udivmodsi4 and is a lot longer as well:
is_udiv32_by_3:
push r28
push r29
rcall .
rcall .
in r28,__SP_L__
in r29,__SP_H__
ldi r18,lo8(3)
ldi r19,0
ldi r20,0
ldi r21,0
rcall __udivmodsi4
std Y+1,r22
std Y+2,r23
std Y+3,r24
std Y+4,r25
ldi r24,lo8(1)
ldd r18,Y+1
ldd r19,Y+2
ldd r20,Y+3
ldd r21,Y+4
or r18,r19
or r18,r20
or r18,r21
breq .L12
ldi r24,0
.L12:
pop __tmp_reg__
pop __tmp_reg__
pop __tmp_reg__
pop __tmp_reg__
pop r29
pop r28
ret
So looking for other algorithms like the alternate cross sum algorithm from @emacs drives me nuts' answer or the Fast divisibility tests (by 2,3,4,5,.., 16)? @Peter Cordes links to look interesting for integers sizes above 16 bits only.
Code sizes and loop cycles would need to be considered more carefully.