I have a for loop that I'm trying to run that is quite slow when I apply it to a dataset with 100k+ observations. What this code does is uses information from one column (df$country) that describes a country assigned to a particular ID (e.g., ID == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1.
Sample data (dput()):
structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"), USA = c(0, 0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
The code is below:
#Assign vector of column names of my dataframe,
#all named after countries (i.e. "Japan").
cols <- names(df[3:5])
#For each ID, for each column name,
#if ID == j and country == column name,
#Change entry in this row under column name to be unity.
for(j in df$id){
for(c in cols){
df[df$id == j & df$country == c, c] <- 1
}}
The code is way too slow to handle. It's been running for 20 minutes on 100k observations and still has not been completed. Are there any ways I can speed up this process? Thank you!