How to make this code faster (learning best practices)?

Question

I have this little loop here, and I was wondering if I do some big mistake, perf wise.

For example, is there a way to rewrite parts of it differently, to make vectorization possible (assuming GCC4.8.1 and all vecotrization friendly flags enabled)?

Is this the best way to pass a list a number (const float name_of_var[])?

The idea of the code is to take a vector (in the mathematical sense, not necesserly a std::vector) of (unsorted numbers) y and two bound values (ox[0]<=ox[1]) and to store in a vector of integers rdx the index i of the entry of y satisfying ox[0]<=y[i]<=ox[1].

rdx can contain m elements and y has capacity n and n>m. If there are more than m values of y[i] satisfying ox[0]<=y[i]<=ox[1] then the code should return the first m

Thanks in advance,

void foo(const int n,const int m,const float y[],const float ox[],int rdx[]){
    int d0,j=0,i=0;
    for(;;){
        i++;
        d0=((y[i]>=ox[0])+(y[i]<=ox[1]))/2;
        if(d0==1){
            rdx[j]=i;
            j++;
        }
        if(j==m)    break;
        if(i==n-1)  break;
    }
}

It would help if you gave us an idea what the code should do. I may be a bit dense, it may even be a well known algorithm, but I don't click. — Peter - Reinstate Monica, Apr 02 '14 at 20:09
@PeterSchneider; sure thanks for the suggestion: I'll edit in now! — user189035, Apr 02 '14 at 20:10
As an aside, your coding style is not very idiomatic. Most people would write `for(int i=0; i — Peter - Reinstate Monica, Apr 02 '14 at 20:13
@PeterSchneider: thanks! I've picked that stuff up alone, so I will gladely follow your advices! If it's okay with you, I will not change that code since sqome people have already used it (it would create confusion) — user189035, Apr 02 '14 at 20:15
Here is something to learn - the compiler can optimize this kind of thing. Hand-written optimizations here may have no observable impact on the speed of your program. You should use a profiler to measure changes. — Neil Kirk, Apr 02 '14 at 20:15
I believe you! But I was wondering about a) the possibility of writing it in a way that would help the compiler vectorize (I don't think it's possible, but I don't know much) and b) I was wondering whether it's a good call to pass vectors (in the mathematical sense) as float name_of_var [] — user189035, Apr 02 '14 at 20:17
And then, more to the point: Now I understand what you are trying to do when you divide the sum of two boolean values by 2 ;-). It probably even works, but is non-idiomatic as well. Why not express what you want to do clearly in the language: `bool isInInterval = y[i] >= ox[0] && y[i] <= ox[1];`? Naming variables properly is important. — Peter - Reinstate Monica, Apr 02 '14 at 20:23
@NeilKirk That is not neccessarily true. Take a look at http://stackoverflow.com/questions/15665433/c-mysteriously-huge-speedup-from-keeping-one-operand-in-a-register/15665995#15665995 You can do quite a bit in C++ to boost performance without going to assembly that the compiler might not optimzie for you. You don't need a profiler to measure performance if you have a decent clock to measure start and stop time with. — UpAndAdam, Apr 02 '14 at 20:29
In terms of style, it's much easier to read if you put in spaces. `if(i==n-1)` becomes `if(i == n - 1)` — Almo, Apr 02 '14 at 21:05
@UpAndAdam How do you use a clock to find out the bottleneck of a non-trivial program with thousands of functions?? — Neil Kirk, Apr 02 '14 at 21:07
@NeilKirk You use if-def's put clock starts and stops at the beginning of your functions and generate a histogram of its execution time... But again in these cases we weren't talking about a non-trivial program or multiple functions. We were talking about a simple operation and maximizing throughput in it. This merely detects that there IS a bottleneck not exactly where it is, that's left to you just like with most profilers : your call to X sucked, or you spent a lot of time here. It doesn't tell you why you spent time there. That all said, premature optimization is also not a great thing. — UpAndAdam, Apr 02 '14 at 21:35
I wouldn't put it on hold. It's a specific question, and it has exactly one answer (the fastest code which does the job), which is fairly short.-- That said, I observed a few surprising details last night: **adding and dividing is much faster than logical AND.** I think modern processors are very fast with arithmetics but hurt by jumps (and the operator&& contains an implicit jump due to the shortcut logic). Then, shifting right is faster than dividing by 2 (the surprise there is that gcc doesn't do that with -O3). Even a little faster then is to just test against 2. It's like coding in 1978. — Peter - Reinstate Monica, Apr 03 '14 at 06:49
The fastest check whether the number is in the interval actually is testing the two conditions for equality. The number is only within the interval if both conditions hold; if outside the interval, one condition is always false while the other is true (never can both be false at the same time). Equality is thus sufficient, and computationally fastest (by a narrow margin -- eliminating the && was the big step, about 20% execution time). — Peter - Reinstate Monica, Apr 03 '14 at 06:57
Another remark to the OP is that the caller will not know how many indices the function found and stored in rdx. One option is to let the function return the numbers of assignments to rdx, i.e. j; another is to initialize rdx with all -1, i.e. a value which cannot be an index, and let the caller find the last element by checking whether the next one is -1, similar to strlen() with 0. — Peter - Reinstate Monica, Apr 03 '14 at 07:03
I have asked a new question which is based on this problem. I received valuable input for technical optimization as well as an idea about how to parallelize (with MS's parallel patterns lib). My new question is here, you may be interested: http://stackoverflow.com/questions/22832888/optimizing-a-comparison-over-array-elements-with-two-conditions-c-abstraction — Peter - Reinstate Monica, Apr 03 '14 at 11:27

Taiki · Accepted Answer · 2014-04-02T20:34:01.663

1

d0=((y[i]>=ox[0])+(y[i]<=ox[1]))/2;
if(d0==1)

I believe the use of an intermediary variable is useless, and take a few more cycles

This is the most optimized version I could think of, but it's totally unreadable...

void foo(int n, int m, float y[],const float ox[],int rdx[])
{
    for(int i = 0; i < n && m != 0; i++)
    {
        if(*y >= *ox && *y <= ox[1])
        {
            *rdx=i;
            rdx++;
            m--;
        }
        y++;
    }
}

I think the following version with a decent optimisation flag should do the job

void foo(int n, int m,const float y[],const float ox[],int rdx[])
{
    for(int j = 0, i = 0; j < m && i < n; i++) //Reorder to put the condition with the highest probability to fail first
    {
        if(y[i] >= ox[0] && y[i] <= ox[1])
        {
            rdx[j++] = i;
        }
    }
}

edited Apr 02 '14 at 20:34

answered Apr 02 '14 at 20:07

Taiki

629
5
13

1

it should get optimized out and not cost you anything in performance but gain in readability. – old_timer Apr 02 '14 at 20:09
Yep, but If we focus on readability, `for(;;)` can be criticized. Also, if this code is compiled with -O3, I doubt any change can be done to significantly improve performances on such short piece of code. – Taiki Apr 02 '14 at 20:12
Further, the if condition can be reduced to `if(y[i]>=oh[0] && y[i]<=ox[1]){` – alvits Apr 02 '14 at 20:15
yep, just catched it :) thanks – Taiki Apr 02 '14 at 20:15
1

Great. Now the poster will get an A+ in his homework. :) – alvits Apr 02 '14 at 20:16
@alvits; this is no homework :) you can check my profile: I'm a phd learning c++ – user189035 Apr 02 '14 at 20:18
You've removed the declaration of `j`, but you still use it. – Rob Kennedy Apr 02 '14 at 20:20
Yep, missed this one :) – Taiki Apr 02 '14 at 20:22
1

I think this is the best we can do, the compiler will have to do its job now – Taiki Apr 02 '14 at 20:23
Thanks! I will take this one as the good answer because I feel it answers the two questions I was wondering when writting this post (No, the if can't be vectorized and Yes, passing many numbers as float name_of_var [] is not a bad way to do things perf wise). – user189035 Apr 02 '14 at 20:30

score 1 · Answer 2 · answered Apr 02 '14 at 20:13

1

Just to make sure I'm correct: you're trying to find the first m+1 (if it's actually m, do j == m-1) values that are in the range of [ ox[0], ox[1] ]?

If so, wouldn't it be better to do:

for (int i=0, j=0;;++i) {
    if (y[i] < ox[0]) continue;
    if (y[i] > ox[1]) continue;
    rdx[j] = i;
    j++;
    if (j == m || i == n-1) break;
}

If y[i] is indeed in the range you must perform both comparisons as we both do.
If y[i] is under ox[0], no need to perform the second comparison.
I avoid the use of division.

answered Apr 02 '14 at 20:13

Arie

113
4

Division by two will be translated into a bit shift of one bit to the right on platforms using base-2 representation of integers. – Daniel Kamil Kozar Apr 02 '14 at 20:16
Is it a typo the (int i=0, j=0;;++i) (and you meant (int i=0,j=0;++i)?) – user189035 Apr 02 '14 at 20:21
It isn't a typo. There's really no condition in there. The break is within the block inside. – alvits Apr 02 '14 at 20:24
Well, usually the loop abort condition is part of the for(...)! – Peter - Reinstate Monica Apr 02 '14 at 20:25
One last question: you used < and > (lines 1 and 2), I'm not familiar with 'continue' but does it keep the index of the y[i] satisfying ox[0]<=y[i]<=ox[1]? – user189035 Apr 02 '14 at 20:26
I also don't like the two continues. The && operator guarantees that the second condition is not evaluated if the first one is wrong already (that's sometimes crucial when checking a pointer against 0 before checking their contents). Therefore an if(cond1 && cond2) is at least as efficient as two successive continues. – Peter - Reinstate Monica Apr 02 '14 at 20:27
The continue will skip the rest of the lines and loop back to the start of the loop. What Arie did was a shortcut. To go to the next iteration when y[i] is outside of the the range. – alvits Apr 02 '14 at 20:30
I totally agree @PeterSchneider. The && in c/c++ uses shortcut thus the second is not evaluated if the first is already `false. – alvits Apr 02 '14 at 20:31
@ Daniel Kamil Kozar: gcc 4.8.2 with -O3 under cygwin didn't substitute a bit shift for the division. Bit shift is actually faster. Back to seventies, I guess. – Peter - Reinstate Monica Apr 03 '14 at 07:05
When I experimented I found that conditions and the implicit jumps they create (including my suggestion of &&) are performance killers. Jumps were always somewhat costly, but maybe more so today because pipelining in the processor makes uninterrupted execution flow _really fast_ (OPS/cycle > 1). If it is interrupted by jumps the pipeline (and caches?) are invalidated and need to be re-filled which voids the benefits of modern CPU architecture. It seems therefore essential to avoid conditional jumps at all costs in performance critical code and replace it by computation (as the OP did). – Peter - Reinstate Monica Apr 03 '14 at 07:11

Peter - Reinstate Monica · Answer 3 · 2014-04-02T21:34:27.173

A. Yes, passing the float array as float[] is not only efficient, it is the only way (and is identical to a float * argument).

A1. But in C++ you can use better types without performance loss. Accessing a vector or array (the standard library container) should not be slower than accessing a plain C style array. I would strongly advise you to use those. In modern C++ there is also the possibility to use iterators and functors; I am no expert there but if you can express the independence of operations on different elements by being more abstract you may give the compiler the chance to generate code that is more suitable for vectorization.

B. You should replace the division by a logical AND, operator&&. The first advantage is that the second condition is not evaluated at all if the first one is false -- this could be your most important performance gain here. The second advantage is expressiveness and thus readability.

C. The intermediate variable d0 will probably disappear when you compile with -O3, but it's unnecessary nonetheless.

The rest is ok performancewise. Idiomatically there is room for improvement as has been shown already.

D. I am not sure about a chance for vectorization with the code as presented here. The compiler will probably do some loop unrolling at -O3; try to let it emit SSE code (cf. http://gcc.gnu.org/onlinedocs/, specifically http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options). Who knows.

Oh, I just realized that your original code passes the constant interval boundaries as an array with 2 elements, ox[]. Since array access is an unnecessary indirection and as such may carry an overhead, using two normal float parameters would be preferred here. Keep them const like your array. You could also name them nicely.

How to make this code faster (learning best practices)?

3 Answers3

Linked