Subset of x86 without a %gs register: binary patching code that uses %gs instead of trapping to emulation?

Question

For reasons too complicated to explain here, I have the need to run a x86 GCC-compiled Linux program on a platform that is a subset of x86. This platform does not have the %gs register, which means it has to be emulated, because GCC relies on the presence of the %gs register.

Currently I have a wrapper which catches the exceptions when the program attempts to access the %gs register, and emulates it. But this is dog slow. Is there a way that I can patch the opcodes in the ELF ahead of time with equivalent instructions, so that the trap-and-emulate is avoided?

score 4 · Answer 1 · answered Aug 02 '11 at 04:38

4

Have you tried compiling your code with the -mno-tls-direct-seg-refs option? From my GCC man page (i686-apple-darwin10-gcc-4.2.1):

   -mtls-direct-seg-refs
   -mno-tls-direct-seg-refs
       Controls whether TLS variables may be accessed with offsets from
       the TLS segment register (%gs for 32-bit, %fs for 64-bit), or
       whether the thread base pointer must be added.  Whether or not this
       is legal depends on the operating system, and whether it maps the
       segment to cover the entire TLS area.

       For systems that use GNU libc, the default is on.

answered Aug 02 '11 at 04:38

Adam Rosenfield

390,455
97
512
589

I thought it was FS for 32-bit and GS for 64-bit? – user541686 Aug 02 '11 at 04:39
@Merhdad: I think it's OS-specific. On Mac OS X v10.6, both 32- and 64-bit executables seem to use GS in a test I just did; Linux 2.6.32 also uses GS for 64-bit. – Adam Rosenfield Aug 02 '11 at 05:00
I was talking about Windows actually, sorry (see [here](http://en.wikipedia.org/wiki/X86-64#Windows)). No idea about the others. – user541686 Aug 02 '11 at 05:14
IIUC for Linux it's GS for 32-bit and FS for 64-bit, FS is reserved for Wine in 32-bit Linux. See http://stackoverflow.com/questions/6611346/amd64-fs-gs-registers-in-linux/6617004#6617004 – ninjalj Aug 04 '11 at 20:35

user786653 · Accepted Answer · 2011-08-02T13:10:02.287

(This is assuming Adam Rosenfields solution is not applicable. It, or a similar approach, is probably a better way to solve it.)

You haven't stated how you're emulating the %gs register, but it's probably going to be tough to patch every usage in general unless you have some special knowledge about the program, because otherwise you only have 2 bytes (in the worst, common case) you can modify with your patch. Of course, if you're using something like %es = %gs it should be relatively straight forward.

Assuming this can somehow be made to work in your case the strategy is to scan the executable sections of the ELF-file and patch any instruction that uses or modifies the GS register. That is at least the following instructions:

Any instruction with the GS segment override prefix (65 expect for branch instructions in which case the prefix indicates something else)
push gs (0F A8)
pop gs (0F A9)
mov r/m16, gs (8C /r)
mov gs, r/m16 (8E /r)
mov gs, r/m64 (REX.W 8E /r) (If you support 64-bit mode)

And any others instructions that allow segment registers (I don't think that are that many more, but I'm not 100% sure).

This is all comming from Intel® 64 and IA-32 Architectures Software Developer's Manual Combined Volumes 2A and 2B: Instruction Set Reference, A-Z. Be aware that the instructions are sometimes prefixed with other prefixes, sometimes not, so you should probably use a library to do the instruction decoding rather than blindly searching for byte sequences.

Some of the above instructions should be relatively straight forward to turn into call my_patch or similar, but you're probably going to have trouble finding something that fits in two bytes and works in general. int XX (CD XX) might be a good candidate if you can setup an interrupt vector, but I'm not sure it's gonna be faster than the method you're currently using. You will of course need to record which instruction was patched out and have the interrupt handler (or whatever) react differently depending on the return address (that your handler receives).

You might be able to setup a trampoline if you can find room within -128..127 bytes and use JMP rel8 (EB cb) to jump to the trampoline (usually another JMP, but this time with more room for the target address), which then handles the instruction emulation and jumps back to the instruction following the patched out %gs usage.

Lastly I'd recommend keeping the trap-and-emulate code running to catch any cases you might not have thought off (self-modifying or injected code for instance). This way you can also log any unhandled cases and add them to your solution.

Right now we're intercepting page faults caused by accessing the non-existent %gs register, would the two-byte interrupt method be faster than intercepting the page fault? — Philip Wernersbach, Aug 02 '11 at 22:42
@Phillip: Since I don't know your execution environment I don't know if it is even possible to install an interrupt handler, and as I wrote it's probably not going to be much faster (you will need to time it yourself). I only mentioned using interrupts because it's a two byte instruction and allows a "far jump". I would have suggested using `int3` (`CC`) which causes a `SIGTRAP`, but that will interfere with debugging. — user786653, Aug 03 '11 at 08:33

Subset of x86 without a %gs register: binary patching code that uses %gs instead of trapping to emulation?

2 Answers2