R: How to sum values from rows only if the key value is the same and also if the key duplicated in consecutive rows?

Question

I have the following data table of this structure:

+-------------------+
| id  | key | value |
+-----+-----+-------+
| 1   | A   | 1000  |
| 1   | A   | 2000  |
| 1   | B   | 2001  |
| 1   | A   | 2002  |
| 1   | A   | 2004  |
| 2   | B   | 2002  |
| 2   | C   | 2002  |
+-------------------+

My objective is to sum the values by id and key but instead of just grouping by id and key, I only want to sum the values if the pair of id and key are the same for consecutive rows.

The result should be:

+-------------------+
| id  | key | value |
+-----+-----+-------+
| 1   | A   | 3000  |
| 1   | B   | 2001  |
| 1   | A   | 4006  |
| 2   | B   | 2002  |
| 2   | C   | 2002  |
+-------------------+

Is there anyway to achieve this result?

score 8 · Accepted Answer · edited Apr 22 '20 at 16:19

We can use rleid from data.table.

We convert the 'data.frame' to 'data.table'. Create another grouping column 'ind' from the 'key' column. Grouped by 'id' and 'ind', we get the sum of 'value' and get the first element of 'key'. We can assign the 'ind' to NULL as it is not needed in the expected output.

library(data.table)
setDT(df1)[,list(value = sum(value), key=key[1L]),
                    by = .(ind=rleid(key), id)][, ind:=NULL][]
#   id value key
#1:  1  3000   A
#2:  1  2001   B
#3:  1  4006   A
#4:  2  2002   B
#5:  2  2002   C

Or as @Frank suggested, we can use mutliple columns within rleid, use it as grouping variable, get the first element of other variables and sum of 'value', assign the unwanted column to NULL to avoid copies.

setDT(df1)[, list(id=id[1L], key=key[1L], value=sum(value)) ,
        by = .(r=rleid(id, key))][, r:= NULL][]
#   id key value
#1:  1   A  3000
#2:  1   B  2001
#3:  1   A  4006
#4:  2   B  2002
#5:  2   C  2002

Or we can use dplyr. We create the grouping variable 'ind' by comparing the adjacent elements of 'key', and get the sum of 'value' and the first element of 'key' with summarise.

library(dplyr)
df1 %>%
     group_by(ind= cumsum(key!=lag(key, default=TRUE)), id) %>%
     summarise(value=sum(value), key=first(key)) %>% 
     ungroup() %>%
     select(-ind)
#  id value key
#1  1  3000   A
#2  1  2001   B
#3  1  4006   A
#4  2  2002   B
#5  2  2002   C

NOTE: In the dplyr and data.table we can also place the 'key' column as the grouping variable and remove the key=key[1L] or key=first(key)).

Or we transform the dataset by creating the 'ind' column and use aggregate from base R to get the expected output

df1 <- transform(df1, ind = cumsum(c(TRUE,head(key,-1)!=tail(key,-1))))
aggregate(value~., df1, FUN=sum)[-3]
#  id key value
#1  1   A  3000
#2  1   B  2001
#3  1   A  4006
#4  2   B  2002
#5  2   C  2002

data

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), key = c("A", 
"A", "B", "A", "A", "B", "C"), value = c(1000L, 2000L, 2001L, 
2002L, 2004L, 2002L, 2002L)), .Names = c("id", "key", "value"
), class = "data.frame", row.names = c(NA, -7L))

You seem to be relying on sorting by `id`..? For what it's worth, `rleid` can take two arguments, getting around this concern: `DT[, .(id=id[1],key=key[1],value=sum(value)), by=.(r=rleid(id,key))][,!"r",with=FALSE]` — Frank, Aug 31 '15 at 14:15
@Frank Thanks. I haven't benchmarked it, hopefully your method would be efficient. — akrun, Aug 31 '15 at 14:23
(Not very important:) I wasn't thinking about efficiency, just logical coherence -- translating "if the pair of id and key are the same for consecutive rows" into code. I'm not sure whether it should be more efficient. — Frank, Aug 31 '15 at 14:33

R: How to sum values from rows only if the key value is the same and also if the key duplicated in consecutive rows?

1 Answers1

data

Linked