4

Let x be a vector of numeric, non-negative data (mostly < 10) and qx <- quantile(x, probs = pq), and where length(pq) is typically > length(x) * (3/4). I am in need of a vector of indices of qx, call it q_i, where x[i] falls in the quantile qx[q_i[i]].

The catch, as the title indicates, is that there may be non-unique values present in qx, e.g. multiple 0-valued quantiles if x is zero-inflated, and potentially other duplicate values. I would like to handle these cases by either (a) recycling the sequence of indices of these equivalent quantiles, or (b) randomly assigning the indices of equivalent quantiles. I think I would prefer option (a), but a solution for either would be useful.

Here is an edit to provide the rules for determining q_i[i] for a particular x[i]: Consider that qx has one or more sequences of duplicate values, i.e. for some j there is (are) sequence(s) qx[j:n] where qx[j] == qx[j + 1] == ... == qx[j + n] < qx[j + n + 1]. Let k = c(j, j + 1,..., j + n). Then q_i[i] <- k[r] where qx[j] <= x[i] <= qx[j + n + 1] if j == 1 or qx[j] < x[i] <= qx[j + n + 1] if j > 1, and where r <- m %% (n + 1) such that x[i] is the m-th occurrence in x where the inequality has been satisfied.

NOTE: based on this rule, I realized I omitted a 4 in my original q_i - this has been changed.

NOTE: @hodgenovice brought up a good point regarding special cases where data values that are strictly smaller than two quantiles may be grouped into the "bin" between two such quantiles. I am not particularly concerned with the special case because, if for example there were no duplicate quantiles but we had the same quantile values, those special cases would correctly be binned together.

I'm thinking there is an efficient way to do this - I have essentially done this using a for loop but I am looking for a vectorized approach.

I started trying to work with cut() which of course doesn't allow non-unique breaks. I found this question here which kind of helped, in that I discovered the .bincode() function, which does allow non-unique breaks. However, it has no rule for "distributing" the indices - it would only use the index of the first of each duplicated quantile value.

Some example code for this problem:

x <- c(5.8,  0.0, 16.1,  5.8,  3.5, 13.8,  6.9,  5.8, 11.5,  9.2, 11.5,
       3.5,  0.0,  8.1,  0.0,  4.6,  5.8,  3.5,  0.0, 10.3,  0.0,  0.0,
       3.5, 6.9, 3.5)
pq <- seq(0, 1, length.out = 20)
qx <- quantile(x, pq)

# quantiles for reference, rounded for readability
round(as.numeric(qx), 2)
[1]  0.00  0.00  0.00  0.00  0.18  3.50  3.50  3.50  3.62  5.04  5.80 5.80  5.97
[14] 6.90  7.72  9.14 10.55 11.50 13.19 16.10

q_i <- .bincode(x, qx, include.lowest = TRUE)
q_i
[1] 10  1 19 10  5 19 13 10 17 16 17  5  1 15  1  9 10  5  1 16  1  1  5 13 5

Here are the results I would be looking for, if .bincode() was magic and I could talk it into doing what I need:

Under scenario (a) above: (I edited this too, as I was originally missing a value of 4)

q_i
[1] 10 1 19 11 5 19 13 10 17 16 17 6 2 15 3 9 11 7 4 16 1 2 5 13 6

Under scenario (b), it could, with low probability, look the same as directly above. Or something like:

q_i
[1] 10 1 19 10 6 19 13 11 17 16 17 5 3 15 2 9 11 6 2 16 1 4 5 13 7

Note here that the full vectors of "equivalent" qx sequences that get recycled are essentially sampled without replacement.

Thanks!

Kyle
  • 83
  • 9
  • im not sure I understand exactly what the issue is. if bincode doesn't do what you want, maybe write your own function that does? Do you have any logic as to how you get from x and qx to q_i? I can't figure out what you're doing – morgan121 Jul 13 '19 at 05:10
  • @kmeanskeal, is there a reason there's no 8, 12, 14 or 18 in your solution for 'Under scenario (a) above', or should they be included (e.g. should the second 17 in the sequence be 18)? – hodgenovice Jul 13 '19 at 05:55
  • @RAB As I understand, qx are just some quantiles of x. q_i are indicies such that x[j] is between qx[q_i[j]] and qx[q_i[j] + 1]. – hodgenovice Jul 13 '19 at 06:12
  • @RAB, I would love to write my own function that does. I posted this in case somebody might have an idea of how to efficiently implement bincode or related functions to accomplish my goal. I'm not sure I can outline the "logic" for going from x and qx to q_i any better than I already have - I'm sorry you can't figure out what I'm doing. I am grouping data into quantiles, where there are repeats of quantiles, and need a way for "distributing" the data points over the duplicated quantiles. – Kyle Jul 13 '19 at 19:08
  • @hodgenovice, yes - the reason those values are not present is that there are no data that fall into the "bins" delineated by the quantiles at those indices. For example, for a value of 8, the data point would need to be > 3.5 and <= 3.62. You are correct in your understanding of my goal! – Kyle Jul 13 '19 at 19:15
  • @kmeanskeal okay, that makes sense. You do have a q_i value of 4, though which surely implies there must be an x such that it's > 0.00 and <= 0.18? Have I missed something again, or should it just be recycled up to 3? – hodgenovice Jul 14 '19 at 00:50
  • @hodgenovice, yes, thank you for pointing out that error! I've edited the question to show that only indices 1, 2, and 3 should be in the resulting `q_i`, recycled until all 0.0 values have been accounted for. – Kyle Jul 15 '19 at 18:07

2 Answers2

0

Okay, I've got some code which continues from yours to get to the final q_i under scenario a (recycling). I wish it was a bit prettier, but hope it helps anyway.

Note:
- This assumes length(x) > length(qx) > length(x)/2.
- In the explanation below the code, q_i refers to the value at the end of the question, before any recycling or replacement of values has taken place.

## Start off with the code provided in the question...
#  1. For each distinct q_i, calculate the number of occurrances, and how far we can recycle it
df <- data.frame(lower=sort(unique(q_i)), freq=as.integer(table(q_i)))
df$upper <- c(df$lower[-1] - df$lower[-nrow(df)], 1) + df$lower - 1
df$upper <- df$upper - as.numeric(df$upper > df$lower & qx[df$upper] < qx[df$upper + 1])

#  2. Identify when there's a (single) number we can't recycle, and identify which position it's in
#     e.g. is it the third time q_i == 10?
df$special_case <- rep(NA, nrow(df))
df$special_case[df$lower < df$upper] <- sapply(df$lower[df$lower < df$upper], function(low) {
                                        bin <- x[q_i==low]
                                        if(length(unique(bin)) > 1) {
                                          return(match(min(bin), bin))} 
                                        else return(NA)})

# 3. For each row of df, get a vector of (possibly recycled) numbers
recycled <- apply(df, 1, function(x) {
  out <- rep(x["lower"]:x["upper"], length.out=x["freq"])

  # This part modifies the vector created to handle the 'special case'
  if(!is.na(x["special_case"])) {
    out[x["special_case"]] <- x["lower"]
    if(x["special_case"] < x["freq"]) {
      out[(x["special_case"]+1):x["freq"]] <- out[x["special_case"]:(x["freq"]-1)]
    }
  }
  return(out)
})

# 3b. Make this follow the same order as q_i
q_i_final <- unlist(recycled)[order(order(q_i))]

q_i_final
[1] 10  1 19 11  5 19 13 10 17 16 17  6  2 15  3  9 11  7  1 16  2  3  5 13  6

What's the basic idea?
For each value of q_i, we can fairly easily calculate the number we should recycle up to (if we should recycle at all). We can typically recycle up to one less than the next largest value of q_i. We can then use rep to create a recycled vector to replace what's in q_i e.g. to replace the four 10s with 10 11 10 11.

What else is there to consider?
This basic idea assumes that for each value of q_i, the corresponding value(s) of x can either be all recycled or all not. This is usually the case, but you can also have some value of q_i where all bar one can be recycled i.e. one k such that x[k] < qx[q_i[k]+1], but one or more j where q_i[j] = q_i[k] and also x[j] = qx[q_i[j]+1].

Such 'special' cases (although not present in the question data) should be identified, and care must be taken that this value is not also recycled.


Special case in more detail

  1. We can make some simple changes to the question data to create this case (see code below). Note that x[5] > x[12], but q_i[5] = q_i[12] = 4. Now, under the 'basic idea' described above, all values where q_i = 4 would be recycled, so we would have q_i_final[12] = 5. This is a problem because we want x[12] to be between qx[q_i_final[12]] and qx[q_i_final[12]+1], which isn't the case since it's strictly less than both. It turns out we are fine to recycle all the values where q_i = 4, except for x[12].

New Code:

# Code copied from question, changes as follows:
# x[12] changed from 3.5 to 3.4
# x[13] and x[21] changed from 0.0 to 10.0
x <- c(5.8,  0.0, 16.1,  5.8,  3.5, 13.8,  6.9,  5.8, 11.5,  9.2, 11.5,
       3.4,  10.0,  8.1,  0.0,  4.6,  5.8,  3.5,  0.0, 10.3,  10.0,  0.0,
       3.5, 6.9, 3.5)
pq <- seq(0, 1, length.out = 20)
qx <- quantile(x, pq)
q_i <- .bincode(x, qx, include.lowest = T, right=T)

q_i
[1]  8  1 19  8  4 19 12  8 17 14 17  4 15 13  1  8  8  4  1 16 15  1  4 12  4
hodgenovice
  • 624
  • 3
  • 14
  • thanks for your response! Your approach makes sense and is very helpful. I am confused about the special case though. Why do you think it would be inappropriate to recycle the `qx` indices for these cases? e.g., suppose you replace `x[5]` with `3.499` in my question. Then if k = 5 and j = 11, we would have this special case, right? Both `x[5]` and `x[11]` would fall in the same duplicate quantiles (that is, they could correctly be assigned to any of `qx[5:7]`), and you should recycle these `qx` indices as usual when changing `q_i[c(5, 11)`. – Kyle Jul 18 '19 at 20:31
  • @kmeanskeal - If `x[5]` was `3.499`, then as I understand, it can only be assigned to `qx[5]` since `qx[5]` < `x[5]` < `qx[6]`, but `x[5]` is not between `qx[6]` and `qx[7]`. Also, this wouldn't quite be the case I was talking about since if you only adjust `x[5]` then `q_i` (before recycling) gives different values for `x[5]` and the values where `x` = 3.5, so you can simply recycle the ones where `x` = 3.5, and leave the 3.499 case unrecycled. I've edited the question to try to explain what I mean for a similar example, though. – hodgenovice Jul 20 '19 at 01:55
  • whoops you're very right, didn't catch that because I was looking at rounded `qx` values! Thanks a ton for updating your answer to provide a good example of that special case. Apologies for taking so long to respond here. I understand the issue now, but I didn't think of this special case and how it would be treated. – Kyle Aug 09 '19 at 17:30
  • However, I would like to treat those data as equivalent for binning purposes because, if for example there were no duplicate quantiles but we had the same quantile values, those special cases (`x[12]` in your edit) and the larger values (`x[5]`) would correctly be assigned the same quantile. I've edited the question to mention this special case. I'm adding a modification of your code that works for me as an answer. – Kyle Aug 09 '19 at 17:31
0

This code is based off of @hodgenovice's answer, but does not consider the special case.

It has an additional condition that correctly recycles the values for the first sequence of duplicated quantiles. This was an error on my part in the question, I originally omitted a q_i of 4 from my desired answer, but it should be one of the indices recycled for data values assigned a q_i of 1 by .bincode().

df <- data.frame(lower=sort(unique(q_i)), freq=as.integer(table(q_i)))
df$upper <- c(df$lower[-1] - df$lower[-nrow(df)], 1) + df$lower - 1
# want to omit this adjustment if the first quantile is also the first
#   duplicate, to follow rule described in question edit
ub <- df$lower != 1
df$upper[ub] <- df$upper[ub] - as.numeric(df$upper[ub] > df$lower[ub] & 
                  qx[df$upper[ub]] < qx[df$upper[ub] + 1])

recycled <- apply(df, 1, function(x) {
  out <- rep(x["lower"]:x["upper"], length.out=x["freq"])

  return(out)
})

q_i_final <- unlist(recycled)[order(order(q_i))]
Kyle
  • 83
  • 9