Improving performance of a simple homebrewed JIT for a loop on a register machine VM

Question

Right now I'm learning assembly right now and my project is to translate byte code from a fantasy architecture to real assembly written in byte code and executing with JIT.

In order to do this, I had to implement instructions from this other architecture. Some of them were simple, just like the common assembly instructions, there were two of them that needed more bytes to implement, like these two:

RX and RY - 32 bit registers memory - array of bytes ordered in little endian that contains the original byte code

mov RX, memory[RY] - reads the next 4 bytes (starting in RY) in memory, shift right those bytes and concatenates them in RX. mov memory[RX], RY - the inverse operation. Reads the value in RY, shift left the bytes to order them in little endian.

In C code, these instructions would be (considering R and mem are global):

 // mov RX, mem[RY]
 void movRxMemRy(unsigned char x, unsigned char y) {

   if (R[y]+3 > 128) endExecution = 1;

   else R[x] = mem[R[y]+3] << 24 | mem[R[y]+2] << 16 | mem[R[y]+1] << 8 | mem[R[y]];
}

// mov mem[RX], RY
void movMemRxRy(unsigned char x, unsigned char y) {
    if (R[x]+3 > 128) {
       endExecution = 1;
    } else {
       mem[R[x]] = (R[y]);  
       mem[R[x]+1] = (R[y]) >> 8;  
       mem[R[x]+2] = (R[y]) >> 16;  
       mem[R[x]+3] = (R[y]) >> 24;        
    }
    return;
}

These instructions were implemented as part of the interpreter, which should be around 5-10x slower (or more) than the assembly/jit implementation, but right now it takes 1/3 of the time (around 1,7~1,8s) to run these instructions. Our implementation of the instructions has to run the following instructions that the professor gave us on the original byte code:

mov R0, 0x006C
mov R1, 0x0001
mov R2, [R0]       # start of huge the loop. [R0] contains the loop counter
cmp R15, R2        # R15 = 0
je 0x0030          # ends the loop execution
mov R14, R2
add R13, R14
sub R2, R1         # decrements the loop counter by 1
mov [R0], R2       # saves the loop counter
jmp 0xFFC8         # returns to the start of the loop

Since the loop is responsible for more than 99% of the execution time, I decided to paste only this part here. The loop counter is the value contained in [R0] and it starts in 0x03885533 (decremented by on 1 each iteration). It quits the loop once the value reaches zero.

The most complex instructions are the ones responsible for most of the execution time, in addition to the add and sub instructions. I need to optimize them to be faster and, if possible, use the least amount of bytes, because I believe that there may be something wrong with them since it is running in 4,5s. The assembly/jit version is faster than the interpreted version and has to run in less than 1 second (the time limit for this project). My current implementation of them is:

r15: contains the array of 16 32-bit "registers" from the fantasy architecture used to store the final result rbx: contains the memory array that has the original byte code (and the counter value).

// Sub instruction is the same as the add instruction, but changing the opcode byte.

void add(unsigned char opcode, unsigned char x, unsigned char y) {

start = c;

// 0x09 - add rx, ry

// mov r14d, [r15+4*y]
machine[c++] = 0x45;
machine[c++] = 0x8b;
machine[c++] = 0x77;
machine[c++] = 4*y;

// add [r15+4x], r14d
machine[c++] = 0x45;
machine[c++] = 0x01;
machine[c++] = 0x77;
machine[c++] = 4*x;

end = c;

for (k= 0; k < (88 - (end-start)); k++) {
    machine[c++] = 0x90;
}
end = c;

}

// mov RX, mem[RY]
void movRxMemRy(unsigned char opcode, unsigned char x, unsigned char y) {

// xor r14, r14
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xf6;

// xor r13, r13
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xed;

// xor r12, r12
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xe4;

// mov    r12d,DWORD PTR [r15+4*Y]
machine[c++] = 0x45;
machine[c++] = 0x8b;
machine[c++] = 0x67;
machine[c++] = 0x4*y;

// mov    r13b,BYTE PTR [rbx+r12*1+0x3]    
machine[c++] = 0x46;
machine[c++] = 0x8a;
machine[c++] = 0x6c;
machine[c++] = 0x23;
machine[c++] = 0x03;

// shl    r13,0x18
machine[c++] = 0x49;
machine[c++] = 0xc1;
machine[c++] = 0xe5;
machine[c++] = 0x18;

// or     r14,r13
machine[c++] = 0x4d;
machine[c++] = 0x09;
machine[c++] = 0xee;

// xor r13, r13
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xed;

// mov    r13b,BYTE PTR [rbx+r12*1+0x2]    
machine[c++] = 0x46;
machine[c++] = 0x8a;
machine[c++] = 0x6c;
machine[c++] = 0x23;
machine[c++] = 0x02;

// shl    r13,0x10
machine[c++] = 0x49;
machine[c++] = 0xc1;
machine[c++] = 0xe5;
machine[c++] = 0x10;

// or     r14,r13
machine[c++] = 0x4d;
machine[c++] = 0x09;
machine[c++] = 0xee;

// xor r13, r13
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xed;

// mov    r13b,BYTE PTR [rbx+r12*1+0x1]    
machine[c++] = 0x46;
machine[c++] = 0x8a;
machine[c++] = 0x6c;
machine[c++] = 0x23;
machine[c++] = 0x01;

// shl    r13,0x18
machine[c++] = 0x49;
machine[c++] = 0xc1;
machine[c++] = 0xe5;
machine[c++] = 0x08;

// or     r14,r13
machine[c++] = 0x4d;
machine[c++] = 0x09;
machine[c++] = 0xee;

// xor r13, r13
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xed;

// mov    r13b,BYTE PTR [rbx+r12*1]    
machine[c++] = 0x46;
machine[c++] = 0x8a;
machine[c++] = 0x2c;
machine[c++] = 0x23;

// or     r14,r13
machine[c++] = 0x4d;
machine[c++] = 0x09;
machine[c++] = 0xee;

// mov    r13b,BYTE PTR [rbx+r12*1+0x3]    
machine[c++] = 0x45;
machine[c++] = 0x89;
machine[c++] = 0x77;
machine[c++] = x*4;


end = c;
}


void movMemRxRy(unsigned char opcode, unsigned char x, unsigned char y) {
start = c;

// xor r14, r14
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xf6;

// xor r13, r13
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xed;

// xor r12, r12
machine[c++] = 0x4d;
machine[c++] = 0x31;
machine[c++] = 0xe4;

// r12d,DWORD PTR [r15+0xc] (atual)
machine[c++] = 0x45;
machine[c++] = 0x8b;
machine[c++] = 0x67;
machine[c++] = 0x4*x;

// mov r14d,DWORD PTR [r15+4*Y]
machine[c++] = 0x45;
machine[c++] = 0x8b;
machine[c++] = 0x77;
machine[c++] = 0x4*y;

// mov DWORD PTR [rbx+r12*1], r14b
machine[c++] = 0x46;
machine[c++] = 0x88;
machine[c++] = 0x34;
machine[c++] = 0x23;

// mov r14d,DWORD PTR [r15+4*Y]
machine[c++] = 0x45;
machine[c++] = 0x8b;
machine[c++] = 0x77;
machine[c++] = 0x4*y;

// shl r14d, 0x8
machine[c++] = 0x41;
machine[c++] = 0xc1;
machine[c++] = 0xee;
machine[c++] = 0x08;

// mov BYTE PTR [rbx+r12*1+0x1],r14b
machine[c++] = 0x46;
machine[c++] = 0x88;
machine[c++] = 0x74;
machine[c++] = 0x23;
machine[c++] = 0x01;

// mov r14d,DWORD PTR [r15+4*Y]
machine[c++] = 0x45;
machine[c++] = 0x8b;
machine[c++] = 0x77;
machine[c++] = 0x4*y;


// shl r14d, 0x10
machine[c++] = 0x41;
machine[c++] = 0xc1;
machine[c++] = 0xee;
machine[c++] = 0x10;

// mov BYTE PTR [rbx+r12*1+0x2],r14b
machine[c++] = 0x46;
machine[c++] = 0x88;
machine[c++] = 0x74;
machine[c++] = 0x23;
machine[c++] = 0x02;

// mov r14d,DWORD PTR [r15+4*Y]
machine[c++] = 0x45;
machine[c++] = 0x8b;
machine[c++] = 0x77;
machine[c++] = 0x4*y;


// shl r14d, 0x18
machine[c++] = 0x41;
machine[c++] = 0xc1;
machine[c++] = 0xee;
machine[c++] = 0x18;

// mov BYTE PTR [rbx+r12*1+0x2],r14b
machine[c++] = 0x46;
machine[c++] = 0x88;
machine[c++] = 0x74;
machine[c++] = 0x23;
machine[c++] = 0x03;

end = c;

for (k= 0; k < (88 - (end-start)); k++) {
    machine[c++] = 0x90;
}
end = c;

}

Since for the first project we don't need the conditional which check if the Rn+3 > 128 , I decided to make it work first without having to implement the conditional on the assembly code.

Any ideas on how I could improve my code to run below 1s? Any help will be appreciated.

Obviously a huge improvement would be to use actual x86-64 registers instead of memory to implement your virtual registers. Store/reload latency is probably hurting. Also those single-byte NOPs are probably not helping the front-end, IDK what the point of that is, or where the `88 - (end-start)` came from; that's a huge amount of NOPs. — Peter Cordes, Oct 29 '21 at 01:37
Also, why would you need to save/reload the loop counter R0 in your virtual-machine asm? Is that VM-memory operation just to create a longer dep chain that reveals some performance effect after JITing? (Also note that a JIT compiler is normally an *optimizer*, not just translating instructions one at a time without optimizing between them. But yes, as toy projects there are just JIT translators like BF to x86 machine code. Related: [Are these the smallest possible x86 macros for these stack operations?](https://stackoverflow.com/q/62542955) for a stack-based VM -> x86) — Peter Cordes, Oct 29 '21 at 01:41
I don't understand what all the complexity is about in the x86-64 machine code for `movRxMemRy`. Why are you doing any byte load/store, and why are you doing any OR operations? If virtual registers are 32-bit, shouldn't `mov R2, [R0]` be doing a 32-bit load? But if you *do* want to do a zero-extending byte load, use `0F B6` `movzx reg, byte [mem]` (https://www.felixcloutier.com/x86/movzx). (Into the 32-bit reg, so hopefully you don't need any REX prefix.) — Peter Cordes, Oct 29 '21 at 01:47
What hardware are you running this on? I assume not Zen2 or Ice Lake? Otherwise their zero-latency store-forwarding (in cases where the addresses match) might be helping. — Peter Cordes, Oct 29 '21 at 01:48
@PeterCordes Each Rn is a virtual 32-bit register that I need to do some operations and print the result after the assembly code is executed, while the memory has 128 bytes. Is there a way to implement all the operations and store/write those values and access the memory only using the registers? The most I could do was to pass both arrays (registers and memory) as arguments to the jit function and store their address on the R15 and RBX. The biggest instruction has 88 bytes, so I had add padding to the other instructions in order to align and make the jumps easier to implement. — Victor, Oct 29 '21 at 02:04
@PeterCordes I was not very clear when writing the post. I'm gonna edit it. The assembly loop is the code we need to execute with our implementation of the instructions. It was made like this on purpose for us to implement our instructions correctly. I'm gonna check this link! — Victor, Oct 29 '21 at 02:09
That's because the memory is an array of bytes. When it performs a mov RX, [RY] operation, it needs to read the 4 bytes starting from position R[Y], shift right them and concatenate to make a 4 bytes value in RX. And the hardware is an old generation Intel that the professor uses it run the projects. — Victor, Oct 29 '21 at 02:15
Your VM is little-endian and x86 is little-endian (with support for unaligned loads/stores). You can therefore just do a dword load and not even need a `bswap` instruction. You should be implementing the VM semantics as directly as possible, not naively transliterating your portable C. (Which will compile to a single dword load or store with modern compilers, like GCC since 8 or something) — Peter Cordes, Oct 29 '21 at 02:25
Surprisingly, GCC and clang are having a hard time compiling the interpreter version to a 32-bit load or store. Clang manages it for 32-bit ARM, but not for x86-64 or even AArch64! https://godbolt.org/z/vE7oqP5q4. Those are just missed optimizations, though, in GCC and clang. You have no reason for making the same missed optimization because you're starting with a 32-bit load, not trying to coalesce adjacent byte accesses. Like in the C version I added which uses `htole32(R[y])` to make a little-endian copy of the store value, then uses memcpy to store it. (It optimized down to one store) — Peter Cordes, Oct 29 '21 at 02:45
@Victor It works the same on real hardware, just do a dword memory access and it'll access four adjacent bytes of memory. No need to emulate. — fuz, Oct 29 '21 at 07:49
Thanks @PeterCordes. I've changed to instruction to use a dword instead of shl/shr each byte and it really improved the performance! Now it's running in 1,2s. It helped a lot, since the heaviest instruction has only 18 bytes now. — Victor, Oct 29 '21 at 22:47
Thanks @fuz ! Just like I said on the previous comment, I changed it to use dword and it improved a lot. I still believe there's something else that I'm doing wrong, since it's running 3-4 times slower than it should be. I might be implementing the jit logic wrong. First I feed the byte array by translating the fantasy byte code (this happens really fast) and then run it using jit all at once. Is this how it should be done? — Victor, Oct 29 '21 at 23:02

score 2 · Answer 1 · edited Oct 30 '21 at 09:30

2

I was making some rookie mistakes by making a direct translation from C code to assembly by implementing the mov operations to concatenate byte by byte while I could've just treated them as a dword.

By doing this, the instructions became clean and faster. Additionally, the padding with NOPs that I had to add to align the instructions and make it easier for the jumps was being unnecessarily executed. So I added a jump to the next instruction at the end of each instruction. As an example, here's how the mov instructions are now implemented:

mov Rx, mem[Ry]:

mov r12d, dword ptr [r15+4y]
mov r12d, dword ptr [rbx+r12]
mov DWORD PTR [r15+4x],r12d
jmp nextInstruction

Mov mem[Rx], Ry:

mov r12d,DWORD PTR [r15+4*y]
mov r14d,DWORD PTR [r15+4*x]
mov DWORD PTR [rbx+2*x],r12d
jmp nextInstruction

By doing this, the code executed in 0.6 seconds.

Thanks to Peter Cordes and fuz who helped identify these problems.

edited Oct 30 '21 at 09:30

Peter Cordes

328,167
45
605
847

answered Oct 30 '21 at 04:33

Victor

91
1
6

1

Why do you need blocks to be fixed-width at all? Having lots of `jmp` instructions sucks, too, hurting code density and fetch/decode throughput, and cluttering up branch prediction. ([Slow jmp-instruction](https://stackoverflow.com/q/38811901)). If you did need to pad blocks, for 30 bytes or less of padding it's usually best to use 2 or 1 long NOPs. Each NOP can be up to 15 bytes. (Or less if you limit it to at most 3 prefix bytes because some CPUs have a problem with that.) [Long multi-byte NOPs: commonly understood macros or other notation](https://stackoverflow.com/q/25545470) – Peter Cordes Oct 30 '21 at 09:27
1

Also, you could still map 12 of your 16 registers directly to x86-64 registers, so VM instructions that used those wouldn't have to load/store from your `R[]` array. That leaves 2 free for scratch regs and one for the `mem` base address, and RSP as the stack pointer. If you could use static storage in a PIE executable for your `mem[]` array to allow `[disp32 + reg*4]` addressing, you could map them all, if you can let RSP be used for a VM register. (Or if you can change your VM to disallow one of the regs so there are only 15.) That would mean some encodings wouldn't need REX prefixes. – Peter Cordes Oct 30 '21 at 09:36

Improving performance of a simple homebrewed JIT for a loop on a register machine VM

1 Answers1