0

I have a for loop that I'm trying to run that is quite slow when I apply it to a dataset with 100k+ observations. What this code does is uses information from one column (df$country) that describes a country assigned to a particular ID (e.g., ID == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1.

Sample data (dput()):

structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan",  "Germany", "Japan", "Japan", "Germany"), USA = c(0, 0, 0, 0,  0, 0), Japan = c(0, 0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0,  0)), row.names = c(NA, 6L), class = "data.frame")

The code is below:

#Assign vector of column names of my dataframe, 
#all named after countries (i.e. "Japan"). 
cols <- names(df[3:5]) 

#For each ID, for each column name,
#if ID == j and country == column name,
#Change entry in this row under column name to be unity.

  for(j in df$id){
    for(c in cols){
      df[df$id == j & df$country == c, c] <- 1
    }}

The code is way too slow to handle. It's been running for 20 minutes on 100k observations and still has not been completed. Are there any ways I can speed up this process? Thank you!

user438383
  • 5,716
  • 8
  • 28
  • 43
ZZ Top
  • 93
  • 5
  • 2
    Please provide some sample data: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – harre Jul 07 '22 at 15:42
  • Here's the dput to a sample: `structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"), USA = c(0, 0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")`. Obviously, `cols <- names(df[3:5])` in this case. – ZZ Top Jul 07 '22 at 15:51
  • 2
    @ZZTop, it would be better if you edit your question and add the sample data to your question. – Ed_Gravy Jul 07 '22 at 16:01
  • 1
    never call a variable `c`... – s_baldur Jul 07 '22 at 16:09
  • 1
    Is your question "how to do one hot encoding in R"? https://stackoverflow.com/questions/48649443/how-to-one-hot-encode-several-categorical-variables-in-r – Jon Spring Jul 07 '22 at 16:18

2 Answers2

4

You could loop over the columns instead of the rows:

for (col in cols) df[[col]] = +(df$country == col)

#   id country USA Japan Germany
# 1  1     USA   1     0       0
# 2  2   Japan   0     1       0
# 3  3 Germany   0     0       1
# 4  4   Japan   0     1       0
# 5  5   Japan   0     1       0
# 6  6 Germany   0     0       1

Also R has a function (model.matrix) to do exactly this:

df[levels(factor(df$country))] = model.matrix(~country - 1, df)
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • This approach works, however for cases where ID is repeated (i.e. ID = 1 in row 7 and row 8, but country = Japan in row 7 and country = Germany in row 8), the dummy variables show 1 under Japan in row 7 and 1 under Germany in row 8, but not 1 under Japan in row 8 and 1 under Germany in row 7. is there a way to fix this? – ZZ Top Jul 07 '22 at 16:48
  • Please help me! – ZZ Top Jul 07 '22 at 17:48
  • Not sure I understand how id is related – s_baldur Jul 07 '22 at 21:19
  • my real data has multiples of the same ID but with different values in "country". I.e., row 7 could have ID == 1 and country == Japan, and row 8 could have ID == 2 with country == Germany. I want (USA, Japan, Germany) to be a vector of ones and zeros denoting IDs that list those countries under the country column. See the below dput: `structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))` – ZZ Top Jul 07 '22 at 22:28
  • @ZZTop so one row can have more than 1 zero? – s_baldur Jul 08 '22 at 09:12
  • You could do: `df |> group_by(id) |> mutate(across(-c(country), ~ max(.)))` – harre Jul 08 '22 at 15:03
2

You could use pivot_wider to do it all in one go:

library(tidyverse)

df |>
  mutate(value = 1) |>
  pivot_wider(id,
              names_from = "country",
              values_fill = 0) |>
  select(-id)

Output:

# A tibble: 6 × 3
    USA Japan Germany
  <dbl> <dbl>   <dbl>
1     1     0       0
2     0     1       0
3     0     0       1
4     0     1       0
5     0     1       0
6     0     0       1

Data:

df <- as.data.frame(structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"))))
harre
  • 7,081
  • 2
  • 16
  • 28
  • I unfortunately don't think that I can use pivot wider. This is because I am using many dataframes and each dataframe has its own country list according to the ID variable. So if dataframe one includes countries (USA, Japan, Germany) but dataframe two includes countries (Georgia, Canada, Spain), then these dataset structures will not align. – ZZ Top Jul 07 '22 at 16:05
  • I don't get what structure you loose. You would probably want to combine the datasets before doing the transformation. If not, you'd combine the data frames with the dummies afterwards using e.g. `bind_rows()` (don't using column binding!) and will not loose the structure. Please update your question including all details to allow for the best suited answers. – harre Jul 07 '22 at 16:19
  • These datasets all are hundreds of thousands of observations, and there's 30 of them. The machine I'm using just can't handle all of those datasets to be merged together prior to pivot_wider. This is why pivot_wider won't work for me, since if I pivot on Data1 with country values (Germany, US, UK), then the columns will be different from pivoted Data2, with country values (Japan, Canada, Belgium). – ZZ Top Jul 07 '22 at 16:22
  • Please explain how the columns will be different in a way that isn't expected? – harre Jul 07 '22 at 16:26
  • Anyway, while my approach is faster than the loop, the `model.matrix` will be the fastest. – harre Jul 07 '22 at 16:31
  • Because, from what I understand of pivot_wider, pivot_wider in Data1 will add columns "Germany, US, UK" but will not add "Japan, Canada, Belgium." Similarly, pivot_wider in Data2 will add columns "Japan, Canada, Belgium" but not "Germany, US, UK". To try and avoid this pivoting issue, I simply made ~196 zero-columns in my datasets, one for each country name, from which I hope to extract data from df$country by df$id. Pivot_wider will simply add new columns instead of inserting values into my existing country columns, and so Data1 and Data2 will be different. – ZZ Top Jul 07 '22 at 16:33
  • My pivoting does exactly the same as the `model.matrix`. I.e. you'd do something like `df[levels(factor(df$country))] <- df |> mutate(value = 1) |> pivot_wider(id, names_from = "country", values_fill = 0) |> select(-id)` afterwards for your purpose. However, I don't exactly get the objective of your structure: If you're short of memory the last thing you want to do is to add 196 x 100.000 zeros. – harre Jul 07 '22 at 16:38