0

This is the code I am running in R:

options(stringsAsFactors=FALSE)
x=read.table("sample.txt")
y=read.table("comp.txt")
nrowx=nrow(x)
nrowy=nrow(y)
for(i in 1:nrowx)
{
    flag=0
    for(j in 1:nrowy)
    {
        if(x[i,2]==y[j,2])      
        {
            x[i,2]=y[j,1]
            flag=1
            break
        }
    }
    if(flag==0)
        x[i,]=NA
}

Here x has 2,000,000 entries while y has around 2,500 entries. It's taking around 1 minute to execute 25 entries of x (as per the code).

Few lines of the file read in x:

"X1" "X2"
"1" 53 "all.downtown@enron.com"
"2" 54 "all.enron-worldwide@enron.com"
"3" 55 "all.worldwide@enron.com"
"4" 56 "all_enron_north.america@enron.com"
"5" 56 "ec.communications@enron.com"
"6" 57 "charlotte@wptf.org"
"7" 58 "sap.mailout@enron.com"
"8" 59 "robert.badeer@enron.com"
"9" 60 "tim.belden@enron.com"
"10" 60 "robert.badeer@enron.com"
"11" 60 "jeff.richter@enron.com"
"12" 60 "valarie.sabo@enron.com"
"13" 60 "carla.hoffman@enron.com"
"14" 60 "murray.o neil@enron.com"
"15" 60 "chris.stokley@enron.com"

Few lines of the file read in y:

"X1" "X2"
"1" 1 "jeff.dasovich@enron.com"
"2" 2 "kay.mann@enron.com"
"3" 3 "sara.shackleton@enron.com"
"4" 4 "tana.jones@enron.com"
"5" 5 "vince.kaminski@enron.com"
"6" 6 "pete.davis@enron.com"
"7" 7 "chris.germany@enron.com"
"8" 8 "matthew.lenhart@enron.com"
"9" 9 "debra.perlingiere@enron.com"
"10" 10 "mark.taylor@enron.com"
"11" 11 "gerald.nemec@enron.com"
"12" 12 "richard.sanders@enron.com"
"13" 13 "james.steffes@enron.com"
"14" 14 "steven.kean@enron.com"
"15" 15 "susan.scott@enron.com"

Please suggest some alternative method to speed up the execution. Thanks! :)

phoenix
  • 335
  • 1
  • 4
  • 19
  • What are you trying to do? matching 2nd columns of dataframes and if matches replace by 1st col? can you also give us those 25 (or even 10) lines of x and y? – Ananta Oct 05 '13 at 06:50
  • Also, the unit *lakh* (100,000) is relatively unknown outside of South Asia. – nograpes Oct 05 '13 at 06:55
  • @Ajanta: I have added few lines of the files on which I am executing the code. Yes, I am matching 2nd columns of dataframes and if matches replace the 2nd col of file read in 'x' by 1st col of file read in 'y'. – phoenix Oct 05 '13 at 07:46
  • Running it in a function rather than globally will already speed it up. – PascalVKooten Oct 05 '13 at 09:15
  • Also, matrices tend to be faster than data frames. – PascalVKooten Oct 05 '13 at 09:18
  • @dualinity: I tried converting into matrix using x=data.matrix(x) and same for y, but I am getting the error "In data.matrix(x) : NAs introduced by coercion", ans similar error message for y. The 2nd columns in both x and y are getting replaced by NA. – phoenix Oct 05 '13 at 09:43
  • @phoenix Could you see if the example below works? – PascalVKooten Oct 05 '13 at 09:47

1 Answers1

2

If I understand it correctly:

If x's email exists in y, then take the number belonging to the emailadress in y, and replace x's emailadress with this y's number?

Possible end results of rows in x:

"100" 60 11
"101" NA NA

So perhaps try this:

x <- as.matrix(x)
y <- as.matrix(y)

# This matcher is about 2 times faster than the built-in match() function.
matcher <- function(i) {
  w <- which(x[i,2] == y[,2])
  ifelse(length(w) > 0, y[w[1],1], NA)
}

x[,2] <- sapply(1:2000000, function(i) matcher(i))
x[is.na(x[,2]), 1] <- NA

Perhaps test first on like 100,000 cases to see what the speed is:

sapply(1:100000, function(i) matcher(i))

The reason it would be faster is because you are not doing loops within loops, but vectorize the problem and use a fast finding-a-match method.

Bonus

Since this is easy to make in parallel, consider this (if your machine has 4 cores):

myParallel <- function(cores, x, y) {
  require(parallel)
  cl <- makeCluster(cores)
  unlist(parSapply(cl, 1:2000000, function(i) matcher(i))
}
x[,2] <- myParallel(cores=4, x, y)

It might just allow you to do this under 2 minutes rather than the current 5m30s!

PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • Thanks a lot Dualinity! It worked real fast!! :) – phoenix Oct 05 '13 at 09:53
  • And did you confirm it worked? How long did it take? – PascalVKooten Oct 05 '13 at 09:54
  • According to my estimation it would still take 20 minutes? – PascalVKooten Oct 05 '13 at 09:58
  • 13 minutes with the latest version (that would be about a 6000 times speed up) – PascalVKooten Oct 05 '13 at 10:01
  • Yeah! I was able to process 200000 entries of x in around 25 seconds!! – phoenix Oct 05 '13 at 10:04
  • And you verified the result? I process 10000 entries in 4 seconds, using 2.5ghz. It seems that it goes 3 times faster for you? I am still a bit suspicious... – PascalVKooten Oct 05 '13 at 10:08
  • I just ran the code on the complete dataset. It took me 5 mins 30 secs to process the whole thing using 2.6GHz!! To be exact, x had 2064441 entries and y had 2359. – phoenix Oct 05 '13 at 10:14
  • How do I return multiple values from the function? In another code, if match occurs, I need to return values of 4 columns corresponding to the row that matches. – phoenix Oct 06 '13 at 05:58
  • `ifelse(length(w) > 0, y[w[1], ], NA)` that is, removing the index for column 1, returning all values of `y`. This will also mean that if it is not there, then rep(NA, 4) (if you have 4 columns). – PascalVKooten Oct 06 '13 at 07:42
  • @phoenix So I think you mean: `ifelse(length(w) > 0, y[w[1], ], rep(NA, 4))` (if there are 4 columns) – PascalVKooten Oct 06 '13 at 07:43
  • I had tried that, but only the value of 1st column is returned. The same is copy pasted in all 4 columns of x. – phoenix Oct 06 '13 at 08:34
  • Did you also change how to store it? `x[,1:4] <- sapply(1:2000000, function(i) matcher(i))`? instead of storing it into `x[,2]` – PascalVKooten Oct 06 '13 at 08:52
  • Though it probably won't work, as it stores them byrow instead of bycolumn. – PascalVKooten Oct 06 '13 at 08:55
  • yes, I did that. the same content in all 4 columns for each row! I guess the only option is to run this code 4 times, returning one of the 4 columns each time. Or to concatenate the contents of the 4 columns and then return it. Then do strsplit. – phoenix Oct 06 '13 at 09:04