5

I have the following data table of this structure:

+-------------------+
| id  | key | value |
+-----+-----+-------+
| 1   | A   | 1000  |
| 1   | A   | 2000  |
| 1   | B   | 2001  |
| 1   | A   | 2002  |
| 1   | A   | 2004  |
| 2   | B   | 2002  |
| 2   | C   | 2002  |
+-------------------+

My objective is to sum the values by id and key but instead of just grouping by id and key, I only want to sum the values if the pair of id and key are the same for consecutive rows.

The result should be:

+-------------------+
| id  | key | value |
+-----+-----+-------+
| 1   | A   | 3000  |
| 1   | B   | 2001  |
| 1   | A   | 4006  |
| 2   | B   | 2002  |
| 2   | C   | 2002  |
+-------------------+

Is there anyway to achieve this result?

Zeke
  • 89
  • 6

1 Answers1

8

We can use rleid from data.table.

We convert the 'data.frame' to 'data.table'. Create another grouping column 'ind' from the 'key' column. Grouped by 'id' and 'ind', we get the sum of 'value' and get the first element of 'key'. We can assign the 'ind' to NULL as it is not needed in the expected output.

library(data.table)
setDT(df1)[,list(value = sum(value), key=key[1L]),
                    by = .(ind=rleid(key), id)][, ind:=NULL][]
#   id value key
#1:  1  3000   A
#2:  1  2001   B
#3:  1  4006   A
#4:  2  2002   B
#5:  2  2002   C

Or as @Frank suggested, we can use mutliple columns within rleid, use it as grouping variable, get the first element of other variables and sum of 'value', assign the unwanted column to NULL to avoid copies.

setDT(df1)[, list(id=id[1L], key=key[1L], value=sum(value)) ,
        by = .(r=rleid(id, key))][, r:= NULL][]
#   id key value
#1:  1   A  3000
#2:  1   B  2001
#3:  1   A  4006
#4:  2   B  2002
#5:  2   C  2002

Or we can use dplyr. We create the grouping variable 'ind' by comparing the adjacent elements of 'key', and get the sum of 'value' and the first element of 'key' with summarise.

library(dplyr)
df1 %>%
     group_by(ind= cumsum(key!=lag(key, default=TRUE)), id) %>%
     summarise(value=sum(value), key=first(key)) %>% 
     ungroup() %>%
     select(-ind)
#  id value key
#1  1  3000   A
#2  1  2001   B
#3  1  4006   A
#4  2  2002   B
#5  2  2002   C

NOTE: In the dplyr and data.table we can also place the 'key' column as the grouping variable and remove the key=key[1L] or key=first(key)).


Or we transform the dataset by creating the 'ind' column and use aggregate from base R to get the expected output

df1 <- transform(df1, ind = cumsum(c(TRUE,head(key,-1)!=tail(key,-1))))
aggregate(value~., df1, FUN=sum)[-3]
#  id key value
#1  1   A  3000
#2  1   B  2001
#3  1   A  4006
#4  2   B  2002
#5  2   C  2002

data

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), key = c("A", 
"A", "B", "A", "A", "B", "C"), value = c(1000L, 2000L, 2001L, 
2002L, 2004L, 2002L, 2002L)), .Names = c("id", "key", "value"
), class = "data.frame", row.names = c(NA, -7L))
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 4
    You seem to be relying on sorting by `id`..? For what it's worth, `rleid` can take two arguments, getting around this concern: `DT[, .(id=id[1],key=key[1],value=sum(value)), by=.(r=rleid(id,key))][,!"r",with=FALSE]` – Frank Aug 31 '15 at 14:15
  • @Frank Thanks. I haven't benchmarked it, hopefully your method would be efficient. – akrun Aug 31 '15 at 14:23
  • 1
    (Not very important:) I wasn't thinking about efficiency, just logical coherence -- translating "if the pair of id and key are the same for consecutive rows" into code. I'm not sure whether it should be more efficient. – Frank Aug 31 '15 at 14:33