2

The title may be a little unclear, so here is a clarification:

The problem:

a = b + c * d;

which in my implementation is resolved to those two "instructions"

mul(c, d, temp)
add(b, temp, a)

I am currently using temporary objects to store the temporary values, which mandates the storing of the temporary value in RAM and fetching it again when it is needed, both of which are not really needed and lower performance.

I am implementing the VM in C++, so my question is if there is some portable way to avoid the storage of temporary values into main memory but keep them on the actual CPU register?

I've done some testing with using the register keyword, but judging from the lack of performance improvement, I'd say the compiler is ignoring it.

As a last resort, I am willing to go for platform specific assembly, but I am pretty much in the dark on the subject, so if this is the only possible way, good info is welcome. I do realize this example I've given is a basic one, and it is more than likely to encounter a situation where a lot of temporary objects are needed, in which case there should be some way to determine how much registers to use and use memory storage for the rest...

Perhaps there is some way to ask for register storage, and if the compiler "runs out of" registers, automatically push the temporaries on the stack? As far as I am familiar with assembly, you "address" specific registers by their name, and I am unclear on how exactly the compiler handles potential register usage conflicts...

Seki
  • 11,135
  • 7
  • 46
  • 70
dtech
  • 47,916
  • 17
  • 112
  • 190
  • 2
    And are you sure that temporary value won't just be stored in register? – Pawel Zubrycki Aug 09 '12 at 09:13
  • @PawelZubrycki - if it is compiled it will most likely be optimized. But in my case I have separate functions the compiler has no idea the arbitrary order of, and those instructions involve assignment. – dtech Aug 09 '12 at 09:16
  • Using `register` is not portable :-), because the number and types of available registers vary between systems. Most compilers ignore the keyword, because they optimize better without it. – Bo Persson Aug 09 '12 at 09:18
  • 3
    @BoPersson The C-level `register` keyword is perfectly portable (and equally useless, that's why it's so portable). –  Aug 09 '12 at 09:19
  • @ddriver So I take it you are writing an interpreter. To avoid the [XY problem](http://meta.stackexchange.com/q/66377), may we take a step back and get more detail? For instance, you seem to assume that spilling temporaries in RAM is *the thing* to optimize away. But where's whatever you are interpreting coming from? Is this a stack- or register-based VM? Are there several types, or just one? –  Aug 09 '12 at 09:25
  • @delnan - it is neither stack nor register based approach, I don't even know how to classify it, it is fairly simple and fairly high level, but so far I am getting very good performance and would like to eliminate that one obvious overhead, since we all know how slow memory access is compared to register operations. – dtech Aug 09 '12 at 09:33
  • Well give us details on your VM then. The only thing I can suggest now is "build a JIT", and while that's a cool exercise and may take you a long way, it takes a lot of effort, limits portability, makes embedding harder, makes features like debugging and profiling much harder to implement, amounts to re-doing half of your code, and has its own set of performance pitfalls. –  Aug 09 '12 at 09:36
  • Writing a JIT compiler is way outside of my competence right now. I would be easier to make my language generate its C++ equivalent and compile that instead... – dtech Aug 09 '12 at 09:40
  • Is the following an accurate restatement of your problem? Your program is parsing expressions like “a = b + c * d;”. Some routine that is part of the parser will recognize “c * d” and perform the multiplication, obtaining some result “temp”. Later, in another routine call, a routine will recognize the “a + ...” and will perform the addition of “a” to “temp”. For efficiency, you would like “temp” to be stored in a register during this interval from one routine call to the next. – Eric Postpischil Aug 09 '12 at 09:57
  • @EricPostpischil - that is correct – dtech Aug 09 '12 at 10:25

4 Answers4

3

Exactly as it is with inline, register is just a recommendation to the compiler. It may or may not follow it, as well as it may or may not store "normal" variables as register ones.

The C++ standard says ( 7.1.1, paragraph 3 ):

A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ Note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated (see D.2). — end note ]

Before doing such a low-level optimization, you should really run a good analysis to determine the bottlenecks of your system, and see if you actually need it or not.

Also, I could bet that if you are not a pro asm programmer, the compiler will optimize the code better than you (no offense to anyone, and I mean it in a general way).

SingerOfTheFall
  • 29,228
  • 8
  • 68
  • 105
  • It is a no-brainer, storing to and loading from memory is very slow, and when it is not really needed it is the source of a significant overhead, many, many CPU cycles that can be saved. – dtech Aug 09 '12 at 09:20
  • 1
    @ddriver There is no such thing as a no-brainer when it comes to optimizing at this level. If you organize your data cleverly, you may hit the cache frequently with those stores (for instance, in a stack-based interpreter, the top of the stack should be in cache). And in most interpreters I am familiar with, there are far worse performance hogs, such as repeated boxing and unboxing, unpredictable branches (the dispatch loop), typechecking and dynamic dispatch. I don't know how much of that applies to your interpreter, but there are likely bigger fish to fry. –  Aug 09 '12 at 09:28
  • @delnan - yes, this is all true, but unlike branching and dynamic dispatch this may be optimized away, considering the temporaries are bound to the statements they are being used in, I can assume they will still be in the CPU cache, since they will still be "hot" by the time they are needed. But cache is still slower than keeping it on the register. – dtech Aug 09 '12 at 09:36
  • @ddriver "deprecated" is Standardese for "Normative for the current edition of the Standard, but not guaranteed to be part of the Standard in future revisions." So if you want your code to be future-proof, don't use it. – TemplateRex Aug 09 '12 at 09:38
  • 1
    @ddriver A L1 cache hit is *very fast*, apparently [in the same ballpark as a register access](http://stackoverflow.com/q/10274355/395760). Contrast this with a branch predition failure, which easily eats dozens of cycles. Please try to find out how much time is spent in instruction dispatch VS memory access for the temporaries. I bet you the former takes far longer. And dispatching *can* be optimized, for instance by more convoluted dispatching schemes. –  Aug 09 '12 at 09:52
2

Registers don't work as you think they work, anyway. A name R2 isn't really different from an address 2. Sure, x86 assembly has fancier names such as ECX, but that's still register 2.

And they're often not physical, either. Like virtual memory, register names are ephemeral. Modern processors may take a while to store a register value to RAM. They could wait for this to finish before recycling the register, but the faster solution is to just recycle the name and let the old (now un-named) register hold the value until the write is finished. This means the number of register addresses can be lower than the number of physical registers. (Another benefit is that newer and more expensive CPU's can have more registers and still be ISA-compatible).

That said, your problem is classicly solved by FMA - Fused Multiply and Add. Your source code should not translate to mul and add, but to mul_add(c,d,b,a). This will allow the C++ compiler to emit an FMA instruction, entirely bypassing the need for a temporarily.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Yes, since `a -+ b */ c` is a frequent format I actually have dedicated "instructions" for those cases, where the compiler optimizes away the temporary objects. However, that was just a simple example to illustrate my problem, there are many more complex cases that cannot be predicted and the number of temporaries needed is greater. – dtech Aug 09 '12 at 10:24
  • @ddriver: You want those pre-built instructions anyway. I bet your example of `mul(); add();` is actually calling them through pointers (since your script compiler can only store those). The overhead of those two indirect calls dwarfs the temporary in L1. – MSalters Aug 09 '12 at 10:34
2

C does not provide any way for you to control whether values are kept in registers between function calls.

You are trying to optimize the wrong thing. The operations required to parse strings and perform emulation will involve many, many low-level processor operations, such as loading bytes, comparing bytes and looking them up in tables, branching based on results of comparisons, pushing routine arguments onto the stack, returning from routines, looking identifiers up in symbol tables, and so on. The simple load of a value from memory is a tiny part of this process.

Even assuming you have separated the parsing and the emulation, so that parsing has produced code in a virtual machine language, the operations required to execute that code still involve many operations, such as loading the bytes for an instruction from memory, decoding those bytes, branching to code to execute the decoded instruction, and so on.

The best you might hope for, when writing in C or C++ or any high-level language, is to write all of the emulation code in one compilation unit (one source file plus the headers it includes), possibly even inside one routine, so that the compiler’s optimizer can see all of it and optimize all of it. In that case, if you have a main loop that is reading, decoding, and executing instructions, the compiler might see that values in temporary values are retained and reused from iteration to iteration, so the compiler might decide to store those temporary values in registers.

However, emulation of a virtual machine is a large task, so your code is likely to have many, many objects. It will have at least one object (likely an array element) for each register in the emulated machine, plus objects for other aspects of machine state, plus objects used to decode instructions and to dispatch to emulation code. The very simplest possible virtual machine emulator, suitable only for classroom exercises, might have few enough objects that most of them fit in processor registers. But any slightly realistic virtual machine emulator will have so many objects that few of them can be retained in processor registers. In this case, you are most likely better off leaving optimization to the optimizer and not trying to do it yourself.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
0

If you turn on optimization when you compile, the compiler should use registers automatically for local variables, if there are enough registers available. Check the generated machine code, to find if it does so in your case.

Klas Lindbäck
  • 33,105
  • 5
  • 57
  • 82
  • "for local variables" -- exactly; in his case there seem to be multiple functions involved, hence registers are probably spilled on the stack more or less frequently. –  Aug 09 '12 at 09:30
  • In my case those are NOT local variables, they exist in different functions that are being called in an arbitrary order so there is no way for the compiler to make those optimizations for me. – dtech Aug 09 '12 at 09:30
  • You can try inlining the functions. – Klas Lindbäck Aug 09 '12 at 10:24
  • 1
    @ddriver - I think you *seriously* underestimate [what a compiler can do](http://stackoverflow.com/a/11639305/597607). – Bo Persson Aug 09 '12 at 11:32