Early-clobbers and named registers

Question

I'm trying to understand the usage of "early-clobber outputs" but I stumbled upon a snipped which confuses me. Consider the following multiply-modulo function:

static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
    uint64_t d;
    uint64_t unused;
    asm ("mulq %3\n\t"
         "divq %4"
         :"=a"(unused), "=&d"(d)
         :"a"(a), "rm"(b), "rm"(n)
         :"cc");
    return d;
}

Why has RDX the early-clobber flag (&)? Is it because mulq implicitly modified RDX? Would the example work without the flag? (I tried and it seems it does. But would it be correct as well?) On the other had, isn't it enough that the function outputs RDX to tell the compiler RDX was modified?

Also, why there is that unused variable? I assume it's there to denote that RAX was modified, correct? Can I remove it? (I tried and it seems to work.) I would have expected the correct way of marking the modified RAX is by including "rax" to "clobbers", along with "cc". But that does not work.

Yes, it's is because `mul` modifies `rdx`. Would it work otherwise? Only by accident. If the compiler decided to allocate `%4` to `rdx` it would be broken. The `unused` is there because certain versions of `gcc` don't allow clobbering an input register. — Jester, Mar 19 '20 at 19:36
@Jester "The `unused` is there because certain versions of `gcc` don't allow clobbering an input register." I assume you meant that to be the reason why I cannot add `"rax"` to clobbers, correct? So if I removed "unused" compiler would think that `RAX` was not modified and perhaps used it later thinking it still contains the original value? — Ecir Hana, Mar 19 '20 at 19:44
No versions of GCC allow clobbering input registers, AFAIK. Unless ancient versions like gcc2 or gcc3 did? But yes, you can tell the compiler that an input is modified by using a dummy output with the same register (or a `"0"` matching constraint for `"=r"` in cases where you can let the compiler choose a reg). Or in a wrapper function it's often convenient to just use an input/output operand like `"+a"(a)`. — Peter Cordes, Mar 19 '20 at 20:08
Yeah it's actually `clang` and only older than version 4.0.0 according to godbolt. I just remembered seeing code that used a clobber so I knew there was **some** compiler that accepted it. (godbolt doesn't have gcc3 or earlier so can't test that.) — Jester, Mar 19 '20 at 23:33
One problem here is that you are mixing explicit register names with 'wildcard' registers (via `rm`, etc). i.e., `"rm"(b)` could, in theory chose a general purpose register, like `%rax` or `%rdx`. Given the restrictions on `divq`, you should express all `asm` registers explicitly. Also `=&d` only applies if you potentially clobber `rdx` after loading a value to that register. — Brett Hale, Mar 20 '20 at 06:31
@BrettHale Yes, the mixing is what confused me. But I'm not sure what are you saying in the following, you mean the above code might not work? "you should express all `asm` registers explicitly" - how to express `RDX` as it gets modified by the first istruction? "Also =&d only applies if you potentially clobber `rdx` after loading a value to that register" - so `=&d` is not sufficient? Sorry, I don't quite understand. — Ecir Hana, Mar 20 '20 at 06:39
@BrettHale: The early clobber on RDX is appropriate: it stops the compiler from picking RDX for the `"rm"` divisor because it's not read until after `mul` writes RDX. (Div will also write RDX.) That will also prevent GCC from letting the multiplier pick RDX, unfortunately. And BTW, the missing early clobber on `"=a"` *will* let the compiler pick RAX as the divisor, in case it knows that `n = a` (even if it doesn't have a compile-time constant value) — Peter Cordes, Mar 20 '20 at 08:07
@PeterCordes - you're right; it's Friday night, and I'm not firing on all cylinders. I should save questions and provide answers tomorrow :) — Brett Hale, Mar 20 '20 at 08:24
@BrettHale: you didn't need to delete your answer here; I think it's a nice addition to the comments. — Peter Cordes, Mar 24 '20 at 00:30
@PeterCordes - thanks. It's doesn't directly address the question, but I think it might help answer a bunch of logical 'follow-up' questions. — Brett Hale, Mar 24 '20 at 09:21

Brett Hale · Answer 1 · 2020-03-23T10:33:35.423

1

While this doesn't answer the question - I think the comments have it covered - I would simplify this, by letting the compiler choose registers vs memory, and allowing it to schedule mulq and divq as required... The problem is that div has register restrictions:

static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
    uint64_t ret, q, rh, rl;

    __asm__ ("mulq %3" : "=a,a" (rl), "=d,d" (rh)
             : "%0,0" (a), "r,m" (b) : "cc");

    /* assert(rh < n), otherwise `div` raises a 'divide error' - the quotient is
     * too large to store in in `%rax`. */

    /* the "%0,0" notation implies that `(a)` and `(b)` are commutative.
     * the "cc" clobber is implicit in gcc / clang asm (and, I expect, Intel icc)
     * for the x86-64 asm statements. */

    __asm__ ("divq %4" : "=a,a" (q), "=d,d" (ret)
             : "0,0" (rl), "1,1" (rh), "r,m" (n), "cc");

    return ret;
}

edited Mar 23 '20 at 10:33

answered Mar 20 '20 at 07:48

Brett Hale

21,653
2
61
90

1

clang tends to do a bad job, and always chooses memory when it's an option. Although I think multi-alternative constraints do avoid that, with clang IIRC ignoring later alternatives. So this might actually compile ok on clang because you used `r,m` and the same option twice for everything else; I supposed clang was your reason for choosing that? – Peter Cordes Mar 20 '20 at 08:02
@PeterCordes - yeah, I've been aware of this for a while... https://stackoverflow.com/questions/16850309/clang-llvm-inline-assembly-multiple-constraints-with-useless-spills-reload ... very unfortunate... – Brett Hale Mar 20 '20 at 08:10
1

I've been advised that this at least adds a supplement to the answer. It's possible that separate `asm` statements may allow the compiler to interleave instructions that don't affect the `mul / div` pair, possibly hiding behind some of the horrible latency of `divq`. – Brett Hale Mar 24 '20 at 09:19
Besides latency, `divq` on Intel SnB-family is microcoded as over 50 uops (many more than `div r/32`), so it ties up the front-end for a while as well. Probably there's not much difference in scheduling an independent instruction ahead of the mul vs. between it and the div, though. `mul` is only a couple uops and 3c latency on most modern x86-64 so the shadow of `mul` separate from `div` is not too large. – Peter Cordes Mar 24 '20 at 09:28
@PeterCordes - I don't follow this stuff in anywhere near the detail you do. That said, is it true that we can expect to see a massive improvement in integer division with Cannon Lake? – Brett Hale Mar 24 '20 at 10:04
1

https://www.uops.info/table.html has actual test results from SKL and Ice Lake. Yes, 64-bit division is much faster now, about 10 cycle throughput instead of 21. And vastly better latency. (14 vs. >=35 for `div r64`). (Intel basically aborted Cannon Lake and skipped to Ice Lake, the new uarch on 10nm. uops.info does have test results from one of the rare CNL laptop CPUs that did get sold, though: same as ICL.) ICL also separates the integer and FP dividers so they can work in parallel; I didn't know it was going to speed up div r64, thanks for the tip. It's also only 4 uops, not microcoded – Peter Cordes Mar 24 '20 at 10:51

Early-clobbers and named registers

1 Answers1