3

The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:

Version 1:

private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;

private static char[] recoverPassword()
{
   char word[];
   int refi, i, indexes[];

   for (int length = 1; length <= MAX_LENGTH; length++)
   {
      refi = length - 1;
      word = new char[length];
      indexes = new int[length];
      indexes[length - 1] = 1;

      while(true)
      {
         i = length - 1;
         while ((++indexes[i]) == N_CHARS)
         {
            word[i] = CHARS[indexes[i] = 0];
            if (--i < 0)
               break;
         }

         if (i < 0)
            break;

         word[i] = CHARS[indexes[i]];

         if (isValid(word))
            return word;
      }
   }
   return null;
}

Version 2:

private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;

private static char[] recoverPassword()
{
   char word[];
   int refi, i, indexes[];

   for (int length = 1; length <= MAX_LENGTH; length++)
   {
      refi = length - 1;
      word = new char[length];
      indexes = new int[length];
      indexes[length - 1] = 1;

      while(true)
      {
         i = refi;
         while ((++indexes[i]) == N_CHARS)
         {
            word[i] = CHARS[indexes[i] = 0];
            if (--i < 0)
               break;
         }

         if (i < 0)
            break;

         word[i] = CHARS[indexes[i]];

         if (isValid(word))
            return word;
      }
   }
   return null;
}

I would expect version 2 to be faster, as it does (and that is the only difference):

i = refi;

...as compare to version 1:

i = length -1;

However, it's the opposite: version 1 is faster by over 3%! Does someone knows why? Is that due to some optimization done by the compiler?

Thank you all for your answers that far. Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference. Your answers have been very helpful, thanks again!

Key

Kei Man
  • 43
  • 4
  • 7
    You might want to make sure your benchmarking is correct. Benchmarking java is not always easy, and 3% is not a lot. – Jon Kiparsky Aug 15 '13 at 06:41

4 Answers4

5

It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.

A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.

Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
1

The correct answer should be retrievable by a deassembler (i mean .class -> .java converter), but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register. I'm more of a c++ guy, but I would start by trying:

const int refi = length - 1;

inside the for loop. Also you should probably use

indexes[ refi ] = 1;
  • 1
    The equivalent in Java is `final int refi = length - 1`. – Jason C Aug 15 '13 at 06:52
  • No it wouldn't; then again @Parakram never suggested he look at the class file, he suggested a decompiler for class files to see what the compiler has optimized from the original source file. – Jurgen Camilleri Aug 15 '13 at 06:56
  • @JurgenCamilleri A decompiler could give some, but limited, information. There is byte code that does not have an equivalent representation in Java, e.g. a decompiler could not show you if a `switch` block was compiled to a jump table, vs. a series of isub+ifeq vs. an offset*value, whatever. – Jason C Aug 15 '13 at 07:04
1

Comparing running times of codes does not give exact or quarantine results

First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.

However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.

What happens behind the scenes?

i = refi;

CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.

i = length -1;

CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.

Summation

As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.

My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.

See also

Analysis of algorithms

Big O notation

Microprocessor

Compiler optimization

CPU Scheduling

erencan
  • 3,725
  • 5
  • 32
  • 50
0

To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.

Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.

So:

  • You should try to check the generated assembly code.
  • If you really care about the performance, create an appropriate micro-benchmark.
Community
  • 1
  • 1
assylias
  • 321,522
  • 82
  • 660
  • 783