-1

I have a data.frame with 32,000 entries. Here's a sample:

# df1
MINEVENT MAXEVENT      EVENTRANGE NUMEVENT cplt_flag
 2680001  2680051 2680001-2680051       51         0
 2680001  2680051 2680001-2680051       51         0
 2680001  2680051 2680001-2680051       51         0
 2680001  2680051 2680001-2680051       51         0
 2680001  2680051 2680001-2680051       51         0
 2680001  2680051 2680001-2680051       51         0

and another with a list of 157 values. Here's a sample:

# df2
source_id
   211535
   211535
   211535
   211536
   211536
   211536

I want to read from the source_id and test whether the value falls between MINEVENT and MAXEVENT. If TRUE then I want to enter a value 1 into cplt_flag, else 0.

I have a code using if-else statements but it runs super slow for the 32,000 entries. Also, I have been trying to use functions and apply functions but cannot get this to work.

I'm looking for an efficient way to get this done.

Arun
  • 116,683
  • 26
  • 284
  • 387
Jordan
  • 18
  • 2

2 Answers2

3

Your dataset doesn't really have any cases where there'll be a TRUE scenario.. But here's a solution using the new non-equi joins feature from the current development version of data.table, v1.9.7. See installation instructions here.

require(data.table) #v1.9.7+

setDT(df2)
setDT(df1)[df2, cplt_flag := 1, on = .(MINEVENT <= source_id, MAXEVENT >= source_id)]

For each row in df2, matching row indices from df1 are extracted where the condition provided to on= argument is satisfied. And on those row indices, cplt_flag is updated in-place with 1.

Arun
  • 116,683
  • 26
  • 284
  • 387
0

An alternative solution using a match.criterion function and one of the apply functions that should be faster than looping. I've added some additional rows of data to test (not exhaustive but illustrative):

df1 <- read.table(text = "
                  MINEVENT MAXEVENT      EVENTRANGE NUMEVENT cplt_flag
                  211535   211634  211535-211634        100         0
                  2680001  2680051 2680001-2680051       51         0
                  2680001  2680051 2680001-2680051       51         0
                  2680001  2680051 2680001-2680051       51         0
                  2680001  2680051 2680001-2680051       51         0
                  2680001  2680051 2680001-2680051       51         0
                  2680001  2680051 2680001-2680051       51         0
                  2680101  2680151 2680101-2680151       51         0", header = TRUE)

df2 <- read.table(text = "
                  source_id
                  211535
                  211535
                  211535
                  211536
                  211536
                  211536
                  2680051", header = TRUE)

match.criterion <- function(source.id, df1) {
  matches <- which(df1$MINEVENT <= source.id & source.id <= df1$MAXEVENT)
  df1$cplt_flag[matches] <<- 1
}

sapply(df2$source_id, match.criterion, df1 = df1)
print(df1)
##  MINEVENT MAXEVENT      EVENTRANGE NUMEVENT cplt_flag
##1   211535   211634   211535-211634      100         1
##2  2680001  2680051 2680001-2680051       51         1
##3  2680001  2680051 2680001-2680051       51         1
##4  2680001  2680051 2680001-2680051       51         1
##5  2680001  2680051 2680001-2680051       51         1
##6  2680001  2680051 2680001-2680051       51         1
##7  2680001  2680051 2680001-2680051       51         1
##8  2680101  2680151 2680101-2680151       51         0

Notes:

  1. The key here is to understand R's scoping rule. To modify a variable outside the scope of the function, use <<- instead of <-. See this for an explanation, and heed the warning concerning using <<-.

  2. This assumes that df1$cplt_flag is initially all zeroes as the match.criterion only sets the rows that match to 1. That is, the rows of df1 that do not match the criterion for each value of source_id is left alone.

Still another vectorizing solution that uses foreach instead of one of the apply functions is:

require(foreach)
foreach(source.id = df2$source_id) %do% match.criterion(source.id, df1)
Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18