5

I want a function that can take a series and a set of bins, and basically round up to the nearest bin. For example:

my_series = [ 1, 1.5, 2, 2.3,  2.6,  3]
def my_function(my_series, bins):
    ...

my_function(my_series, bins=[1,2,3])
> [1,2,2,3,3,3]

This seems to be very close to what Numpy's Digitize is intended to do, but it produces the wrong values (asterisks for wrong values):

np.digitize(my_series, bins= [1,2,3], right=False)
> [1, 1*, 2, 2*, 2*, 3]

The reason why it's wrong is clear from the documentation:

Each index i returned is such that bins[i-1] <= x < bins[i] if bins is monotonically increasing, or bins[i-1] > x >= bins[i] if bins is monotonically decreasing. If values in x are beyond the bounds of bins, 0 or len(bins) is returned as appropriate. If right is True, then the right bin is closed so that the index i is such that bins[i-1] < x <= bins[i] or bins[i-1] >= x > bins[i]`` if bins is monotonically increasing or decreasing, respectively.

I can kind of get closer to what I want if I enter in the values decreasing and set "right" to True...

np.digitize(my_series, bins= [3,2,1], right=True)
> [3, 2, 2, 1, 1, 1]

but then I'll have to think of a way of basically methodically reversing the lowest number assignment (1) with the highest number assignment (3). It's simple when there are just 3 bins, but will get hairier when the number of bins get longer..there must be a more elegant way of doing all this.

Christian Ternus
  • 8,406
  • 24
  • 39
Afflatus
  • 2,302
  • 5
  • 25
  • 40
  • How about `np.digitize(a,bins,right=True)+1` with the bins as in the original order? – Divakar Sep 08 '16 at 04:40
  • In some cases the bins may not be incrementing by 1, so it could be a bins like [0,4,8,12,...]. Ideally the answer would also extend to variable/noon regular steps between intervals (like [0,2,4,7,11,16] ) but that's less important. – Afflatus Sep 08 '16 at 04:47
  • 1
    I mean that's to get the indices and to get the corresponding bin values, do something like : `np.take(bins,np.digitize(a,bins,right=True))`. Should take care of irregular bin spacings. Won't that work? – Divakar Sep 08 '16 at 04:51
  • Yes, that works and is pretty straightforward. Good idea. – Afflatus Sep 08 '16 at 12:34
  • If you publish it as an official solution, I will select it as the selected answer -- Based on my tests with timeit, it scales much better. The times are the same for small series, but for my use case (which isn't even all that big), your solution takes 1/20th of the time as his. – Afflatus Sep 08 '16 at 17:58

3 Answers3

3

We can simply use np.digitize with its right option set as True to get the indices and then to extract the corresponding elements off bins, bring in np.take, like so -

np.take(bins,np.digitize(a,bins,right=True))
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    Selecting this as the official answer b/c it performs better on bigger series. As mentioned in the comments to the post -- for my particular use case, it takes 1/20th the amount of time as Christian's answer. For shorter time series, the times are in the same ball park. – Afflatus Sep 08 '16 at 18:28
1

I believe np.searchsorted will do what you want:

Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.

In [1]: my_series = [1, 1.5, 2, 2.3, 2.6, 3]

In [2]: bins = [1,2,3]

In [3]: import numpy as np

In [4]: [bins[k] for k in np.searchsorted(bins, my_series)]
Out[4]: [1, 2, 2, 3, 3, 3]

(As of numpy 1.10.0, digitize is implemented in terms of searchsorted.)

Christian Ternus
  • 8,406
  • 24
  • 39
1

Another way would be:

In [25]: def find_nearest(array,value):
    ...:     idx = (np.abs(array-np.ceil(value))).argmin()
    ...:     return array[idx]
    ...: 

In [26]: my_series = np.array([ 1, 1.5, 2, 2.3,  2.6,  3])

In [27]: bins = [1, 2, 3]

In [28]: [find_nearest(bins, x) for x in my_series]
Out[28]: [1, 2, 2, 3, 3, 3]
Nehal J Wani
  • 16,071
  • 3
  • 64
  • 89
  • Although it works for the example case, it doesn't work consistenty, maybe because of some rounding issues with either np.ciel() or argmin(). See example: bins = [100,200,300,400,500,600,700,800,900] my_series = [123,157,533,644,222,343] [find_nearest(bins, x) for x in my_series] yields [100*, 200, 600, 600*, 200*, 300*] – Afflatus Sep 08 '16 at 12:28