x86 sbb with same register as first and second operand

Question

I am analyzing a sequence of x86 instructions, and become confused with the following code:

135328495: sbb edx, edx
135328497: neg edx
135328499: test edx, edx
135328503: jz 0x810f31c

I understand that sbb equals to des = des - (src + CF), in other words, the first instruction somehow put -CF into edx. Then it negtive -CF into CF, and test whether CF equals to zero??

But note that jz checks flag ZF, not CF! So basically what is the above code sequence trying to do? This is a legal x86 instruction sequence, produced by g++ version 4.6.3.

The C++ code is actually from the botan project. You can find the overall assembly code (the Botan RSA decryption example) at here. There are quite a lot of such instruction sequence in the disassembled code.

What source did this come from? What compile options? (Optimization enabled? Any `-mtune=bdver2` or something? AMD CPUs recognize `sbb same,same` as independent of the old value of the reg, and depending only on CF. On an Intel CPU, it would probably be faster to `setc dl` / `movzx edx, dl` (or better, [have edx zeroed before running setc, preferably with a recognizes zeroing idiom to avoid a partial-register stall](http://stackoverflow.com/a/33668295/224132)), so I'm surprised g++ would use this sequence for anything.) It's weird that it uses TEST, since NEG already sets flags. — Peter Cordes, Dec 08 '16 at 04:55
Does anything use EDX after that? It must, or else gcc would have just used JC, right? Maybe the source was such a mess that gcc didn't manage to optimize it? — Peter Cordes, Dec 08 '16 at 05:01
The `sbb`/`neg` thing is a very common MSVC idiom to generate branchless code, but I don't normally see GCC using it—it tends to prefer `setc` as Peter suggested. Either way, though, since this is not branchless code, the contortions to avoid a `jc` doesn't make much sense. It also looks to me like the `test` instruction is redundant. You said the code came from GCC 4.6.3, but where did the *input* code come from? Do you have the original C or C++ source code, or are you decompiling an opaque binary? — Cody Gray - on strike, Dec 08 '16 at 11:12
I would love to see how `g++` does produce this and when, because that code doesn't look very good, while the g++ with optimizations on usually does produce much better stuff. This looks more like some artificially built sequence. — Ped7g, Dec 08 '16 at 11:20
except for using a big integer type, I can't think of a case when a C compiler emits code that uses CF — phuclv, Dec 08 '16 at 11:52
@LưuVĩnhPhúc: You mean other than unsigned and FP comparisons? — Peter Cordes, Dec 08 '16 at 15:39
@PeterCordes I can get GCC 4.6.4 to generate this sequence with a somewhat contrived example: https://godbolt.org/g/K2rUPv — Ross Ridge, Dec 08 '16 at 19:54
@RossRidge: ahh, having a branch target after the NEG probably explains it. (\@OP: Agner Fog's objconv puts branch targets in disassembly output). gcc switched idioms from `cmp eax,1/sbb/neg` to `xor/test/setz` with gcc4.7. (@Cody [points out that setcc was slow on old CPUs for some strange reason](http://stackoverflow.com/questions/41031912/x86-sbb-with-same-register-as-first-and-second-operand?noredirect=1#comment69294883_41038556)), so it does make sense for `-O3 -mtune=generic` on such an old compiler. On CPUs since k7 and core2, SBB/NEG has no advantage except maybe code-size. — Peter Cordes, Dec 08 '16 at 20:10
@PeterCordes I don't think GCC ever used the SBB/NEG idiom, the NEG is only there because of an explicit negation in the code (making it somewhat contrived). GCC used just used SBB to calculate `e = a > b ? 0 : -1` and then NEG to calculate `e = -e`. If you simplify it to `e = a > b ? 0 : 1` or just `e = a > b` then GCC 4.6.4 uses a SETcc instruction. btw. modern versions of GCC will still use SBB in certain cases like `a > b ? 1 : 10`. — Ross Ridge, Dec 08 '16 at 20:38
@RossRidge: ah I see, that is pretty contrived. Thanks for clearing that up. — Peter Cordes, Dec 08 '16 at 21:05
@PeterCordes, thank you for your very informative reply. I updated the question for your reference. — lllllllllllll, Dec 09 '16 at 19:52
@RossRidge, many thanks for your example and your reply. I updated the question for your information. — lllllllllllll, Dec 09 '16 at 19:53
@LưuVĩnhPhúc. The assembly code is produced by compiling the `Botan` crypo library. They basically maintain the numbers as `big int`. I updated the question for your information. — lllllllllllll, Dec 09 '16 at 19:54
If that's some inner crypto loop used to check cypher, then this is a bit insecure. The branching will produce variations in EMF noise signal from CPU. Crypto loops should be written in fixed-code-executed way whether the provided value is valid or invalid key, so the EMF noise doesn't give attacker hint in case of brute-force attack, which bit in key was first incorrect one. — Ped7g, Dec 09 '16 at 20:16
Hi @Ped7g, Could you please elaborate more on `EMF noise signal` and how it can be used to for attacking? I quickly googled this one but cannot find any useful information. — lllllllllllll, Dec 09 '16 at 21:04
I think I had on mind the acoustic article [here](https://www.extremetech.com/extreme/173108-researchers-crack-the-worlds-toughest-encryption-by-listening-to-the-tiny-sounds-made-by-your-computers-cpu) (just didn't recall details properly), but I found also [EM one](https://www.schneier.com/blog/archives/2016/02/practical_tempe.html) now. | How it can be used... you simply start brute forcing key by requesting target computer to decrypt your request. When it will hit wrong bit in key, the code shows small variation, so you measure which bit is wrong first. Flip that one. Send new request. — Ped7g, Dec 09 '16 at 22:25
The real attack is lot more complicated, I think you need proper key recording to compare with to detect deviations, etc... I didn't study it into depth, I'm not security folk, I prefer just to write code. :) But I hope as rough principle my description works. — Ped7g, Dec 09 '16 at 22:30

score 9 · Accepted Answer · edited May 23 '17 at 12:01

sbb edx, edx

Your analysis of this instruction is correct. SBB means "subtract with borrow". It subtracts the source from the destination in a way that takes the carry flag (CF) into account.

As such, it is equivalent to dst = dst - (src + CF), so this is edx = edx - (edx + CF), or simply edx = -CF.

Don't let it fool you that the source and destination operands are both edx here! SBB same, same is a pretty common idiom in compiler-generated code to isolate the carry flag (CF), especially when they are attempting to generate branchless code. There are alternative ways of doing this, namely the SETC instruction, which is probably faster on most x86 architectures (see comments for a more thorough dissection), but not by a significant amount. Compilers from different vendors (and possibly even different versions) tend to have a preference for one or the other, and use that everywhere, when you're not doing architecture-specific tuning.

neg edx

Again, your analysis of this instruction is correct. It's a pretty simple one. NEG performs a two's-complement negation on its operand. Therefore, this is just edx = -edx.

In this case, we know that edx originally contained -CF, which means that its initial value was either 0 or -1 (because CF is always either 0 or 1, on or off). Negating it means that edx now contains either 0 or 1.

That is, if CF was originally set, edx will now contain 1; otherwise, it will contain 0. This is really the completion of the idiom discussed above; you need the NEG to fully isolate the value of CF.

test edx, edx

The TEST instruction is the same as the AND instruction, except that it does not affect the destination operand—it only sets flags.

But this is another special case. TEST same, same is a standard idiom in optimized code to efficiently determine if the value in a register is 0. You could write CMP edx, 0, which is what a human programmer would naïvely do, but test is faster. (Why does this work? Because of the truth table for bitwise AND. The only case where value & value == 0 is when value is 0.)

So this has the effect of setting flags. Specifically, it sets the zero flag (ZF) if edx is 0, and clears it if edx is non-zero.

Therefore, if CF was originally set, ZF will now be clear; otherwise, it will be set. Perhaps a simpler way of looking at it is that these three instructions set ZF to the opposite of the original value of CF.

Here are the two possible data flows:

CF == 0 → edx = 0 → edx = 0 → ZF = 1
CF == 1 → edx = -1 → edx = 1 → ZF = 0

jz 0x810f31c

Finally, this is a conditional jump based on the value of ZF. If ZF is set, it jumps to 0x810f31c; otherwise, it falls through to the next instruction.

Putting everything together, then, this code tests the complement of the carry flag (CF) via an indirect route that involves the zero flag (ZF). It branches if the carry flag was originally clear, and falls through if the carry flag was originally set.

That's how it works. That said, I cannot explain why the compiler chose to generate the code this way. It appears to be sub-optimal on a number of levels. Most obviously, the compiler could have simply emitted a JNC instruction (jump if not carry). Although Peter Cordes and I have made various other observations and speculations in comments, I don't think it makes sense to incorporate all of this into an answer unless a bit more information can be provided about the origin of this code.

`sbb edx,edx` / `neg` is clearly worse than `setc` into an xor-zeroed register [(last part of this answer)](http://stackoverflow.com/a/33668295/224132). SBB has a false dependency on EDX on everything except AMD, and SBB is 2 uops on Intel pre-Broadwell. `sbb same,same` can be a useful idiom when you want all-ones (i.e. -1), but using NEG drags it down to be at best equal. It's a small difference, but I wouldn't say it's a toss-up, since there are no CPUs where SBB/NEG is better and many where it's worse. Note that even on AMD, both SBB and NEG are on the critical path, unlike XOR/SETC. — Peter Cordes, Dec 08 '16 at 15:46
I think the OP meant that NEG changes the `-CF` value in EDX into the original value of CF, still in EDX. Not that it sets CF in EFLAGS. So I think the OP understood it correctly, but phrased it ambiguously. — Peter Cordes, Dec 08 '16 at 15:49
@peter Hmm, I distinctly remembering both reading and confirming for myself in benchmarks that `xor`/`setc` was slower than `sbb`/`neg`. And this wasn't for AMD processors, it was definitely Intel. It was almost certainly old; the original source may have been talking about PII, and my tests were on PIII and P4 (and P4 is a strange bird indeed). The fact that MSVC has generated this code since forever also strongly suggests that it was preferable at one time, probably on the 486 or Pentium where they first made this choice. Intel used to caution that `setCC` was slow and to be avoided. — Cody Gray - on strike, Dec 08 '16 at 15:55
`setCC` was also 2 uops on PIII and P4, so the only issue with `sbb` would have been the false dependency. (Intel also points out that `setCC` is high latency, but I believe that wouldn't matter here, since it's not on the critical path.) I'd have to profile this again to be sure. You can't really trust the uops to tell the whole story! But thanks for pointing that out. It may be that my knowledge is out of date and needs to be revisited. (Also, I knew what the OP meant. I didn't say he was wrong in his analysis, I said he was right!) — Cody Gray - on strike, Dec 08 '16 at 15:58
Huh, I hadn't looked at SETCC on old CPUs. It's 1 uop / 1c latency on everything that's still relevant. Agner has it at 1c latency as far back as Merom, and 1 uop with unlisted latency before that (and slow on P4). On AMD, it's 1m-op / 1c back to K7, with high throughput. I was just saying 2 uops as a short-hand for 2 dependent uops with 2c latency, since comment space is limited. But anyway, I guess that explains MSVC's choice of idiom; thanks for the history lesson! — Peter Cordes, Dec 08 '16 at 16:06
Ohh, I think I just figured out why MSVC uses `sbb`—pairing. This optimization decision was made back when explicit pairing was critical, and while `sbb` pairs only in the U pipe, it can pair easily with `neg`. `setCC` never pairs, because it has an `0Fh` prefix. And that prefix is probably also why Intel (used to?) caution against using it, because decoding prefixed instructions is/was slower than non-prefixed instructions. Anyway, I have no earthly idea how you remember all of this stuff so easily. I am incredibly jealous, @peter. — Cody Gray - on strike, Dec 08 '16 at 16:08
I have Agner Fog's spreadsheet open in Libreoffice basically all the time. I had to tab over to it about 5 times while writing the previous comment to see when Agner first listed a latency for SETCC on a P6 core, and to check on AMD. P5 pairing makes sense, too, though. But can a dependent NEG pair with the SBB that produces its input? More likely SBB can pair with a previous insn, and NEG with the next insn. — Peter Cordes, Dec 08 '16 at 16:08

score 4 · Answer 2 · answered Dec 08 '16 at 11:15

I understand that sbb equals to des = des - (src + CF), in other words, the first instruction somehow put -CF into edx.

Yes, edx = edx - (edx + CF) = -CF. So sbb edx,edx will set edx to 0 when CF=0, and to -1 (0xFFFFFFFF) when CF=1. Also the subtraction itself results into new CF value, which is equal to old one if I'm not too much confused.

Then it negtive -CF into CF, and test whether CF equals to zero??

Almost yes but no. It negates edx, not CF. To negate CF there's separate instruction CMC (from stc/clc/cmc carry flag modification instructions family).

So from 0/-1 the edx will be modified to 0/1, CF will be again set to 0/1 (wow, I didn't know neg sets CF as ~ZF). Also the neg already sets ZF, so the following test edx,edx is redundant.

test edx,edx doesn't test CF, but edx (at this moment 0 or 1), and it will produce CF=0 and ZF=1/0 by the 0/1 value.

So you went off in your thinking by holding upon the fact that numeric value in edx originated from CF, you kept thinking about CF, but actually since the first sbb you can forget about old CF, each next instruction (including the sbb) is arithmetic, thus it does modify CF in it's own way. But those neg/test instructions are edx focused, on the number in register, CF is just side-product of their calculation.

But note that jz checks flag ZF, not CF!

Indeed, as the CF does contain 0 after last test, completely unrelated to the initial CF value ahead of sbb. On the other hand, the ZF is directly related to the original CF value, in a way that if the code started with CF=1, then the last jz will be not taken (ZF=0), and if the code started under CF=0, the last jz will be taken (ZF=1).

I think the OP was talking about the value in EDX, and just labelling it `-CF` and then `CF`, meaning the starting value of CF. Or maybe that *jz checks flag ZF, not CF* is a clue that they didn't follow it correctly. — Peter Cordes, Dec 08 '16 at 15:52
@PeterCordes many thanks for your example and your reply. I updated the question for your information. — lllllllllllll, Dec 09 '16 at 19:54

x86 sbb with same register as first and second operand

2 Answers2