Reshaping data in R with "login" "logout" times

Question

I'm new to R, and am working on a side project for my own purposes. I have this data (reproducable dput of this is at the end of the question):

     X            datetime  user  state
1    1 2016-02-19 19:13:26 User1 joined
2    2 2016-02-19 19:21:18 User2 joined
3    3 2016-02-19 19:21:33 User1 joined
4    4 2016-02-19 19:35:38 User1 joined
5    5 2016-02-19 19:44:15 User1 joined
6    6 2016-02-19 19:48:55 User1 joined
7    7 2016-02-19 19:52:40 User1 joined
8    8 2016-02-19 19:53:15 User3 joined
9    9 2016-02-19 20:02:34 User3 joined
10  10 2016-02-19 20:13:48 User3 joined
19 637 2016-02-19 19:13:32 User1   left
20 638 2016-02-19 19:25:26 User1   left
21 639 2016-02-19 19:30:30 User2   left
22 640 2016-02-19 19:42:16 User1   left
23 641 2016-02-19 19:47:59 User1   left
24 642 2016-02-19 19:51:06 User1   left
25 643 2016-02-19 20:02:26 User3   left

I want it to look like this:

    user  joined                left
1   User1 2016-02-19 19:13:26   2016-02-19 19:13:32
2   User2 2016-02-19 19:21:18   2016-02-19 19:30:30
3   User3 2016-02-19 19:53:15   2016-02-19 20:02:26 
4   User1 2016-02-19 19:21:33   2016-02-19 19:25:26
.
.
.

I'm looking at tidyr as there's some reshaping involved obviously, but I can't wrap my head around what exactly needs to be done. Is this even possible (without looping/massive amounts of procedural code)? The problem I can't grasp how to get around is that there's no way to know that a particular "left" record should be joined to a particular "joined" record. Examples I can find all involve a static month or day over which other values are gathered. I should add that it's not necessarily guaranteed that all records are guaranteed to have a "left" value (a user might still be "joined").

Here's a sample of the data:

samp <- data.frame(
  X = c(
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
    15L, 16L, 17L, 18L, 637L, 638L, 639L, 640L, 641L, 642L, 643L,
    644L, 645L, 646L, 647L, 648L, 649L, 650L, 651L
  ),
  datetime = factor(c(
    "2016-02-19 19:13:26", "2016-02-19 19:21:18", "2016-02-19 19:21:33",
    "2016-02-19 19:35:38", "2016-02-19 19:44:15", "2016-02-19 19:48:55",
    "2016-02-19 19:52:40", "2016-02-19 19:53:15", "2016-02-19 20:02:34",
    "2016-02-19 20:13:48", "2016-02-19 20:49:31", "2016-02-19 20:59:58",
    "2016-02-19 21:06:20", "2016-02-19 21:11:15", "2016-02-19 21:11:22",
    "2016-02-19 22:05:18", "2016-02-19 22:05:47", "2016-02-19 22:30:30",
    "2016-02-19 19:13:32", "2016-02-19 19:25:26", "2016-02-19 19:30:30",
    "2016-02-19 19:42:16", "2016-02-19 19:47:59", "2016-02-19 19:51:06",
    "2016-02-19 20:02:26", "2016-02-19 20:13:38", "2016-02-19 20:42:27",
    "2016-02-19 20:48:22", "2016-02-19 21:10:43", "2016-02-19 21:11:13",
    "2016-02-19 21:17:33", "2016-02-19 22:02:45", "2016-02-19 22:05:37"
  )),
  user = factor(rep(
    c(
      "User1", "User2", "User1", "User3", "User4", "User1", "User4", "User3",
      "User1", "User2", "User1", "User3", "User1", "User4", "User1", "User4"
    ),
    c(
      1L, 1L, 5L, 4L, 1L, 2L, 3L, 1L, 2L, 1L, 3L, 3L, 1L, 1L, 2L,
      2L
    )
  )),
  state = factor(rep(c("joined", "left"), c(18L, 15L)))
)

Some more info: it's safe to assume there are corresponding "left" values for all but the last "joined" record (this isn't true in the example dataset. I cut down the real data into something smaller to post here). It seems to be that you could cut the dataset by user/status then colbind the date each user left to solve this problem. `ts<-spread(test, state, datetime)` gets the dataset prepared a lot of the way. — Tim Coker, Mar 11 '16 at 04:58
this is a basic reshape if you first make the user id unique by state then `reshape(samp, drop = 'X', dir = 'wide', idvar = 'user', timevar = 'state', v.names = 'datetime')` — rawr, Mar 11 '16 at 05:11

Arun · Answer 1 · 2017-04-06T20:16:20.790

Using rowid() from the data.table-package along with dcast:

require(data.table)
dcast(dt, user + rowid(user, state) ~ state, value.var="datetime")

#      user user_1              joined                left
#  1: User1      1 2016-02-19 19:13:26 2016-02-19 19:13:32
#  2: User1      2 2016-02-19 19:21:33 2016-02-19 19:25:26
#  3: User1      3 2016-02-19 19:35:38 2016-02-19 19:42:16
#  4: User1      4 2016-02-19 19:44:15 2016-02-19 19:47:59
#  5: User1      5 2016-02-19 19:48:55 2016-02-19 19:51:06
#  6: User1      6 2016-02-19 19:52:40                <NA>
#  7: User2      1 2016-02-19 19:21:18 2016-02-19 19:30:30
#  8: User3      1 2016-02-19 19:53:15 2016-02-19 20:02:26
#  9: User3      2 2016-02-19 20:02:34                <NA>
# 10: User3      3 2016-02-19 20:13:48                <NA>

SymbolixAU · Accepted Answer · 2016-03-11T05:08:00.060

We can make use of the order of "left" and "joined", and match when one follows the other for each user.

For this I'm going to use library(data.table)

library(data.table)
setDT(df)

## order the data by user and datetime
df <- df[order(user, datetime)]
## add an 'order' column, which is a sequence from 1 to lenght()  
## for each user
df[, order := seq(1:.N), by=user]

## split the left and joins
dt_left <- df[state == "left"]
dt_joined <- df[state == "joined"]

## assuming 'left' is after 'joined', shift the 'order' back for left
dt_left[, order := order - 1]

## join user an dorder (and subsetting relevant columns) 
## keeping when there's a 'joined' but not a 'left'
dt <- dt_left[, .(user, order, datetime)][dt_joined[, .(user, order, datetime)], on=c("user", "order"), nomatch=NA]

## rename columns
setnames(dt, c("datetime", "i.datetime"), c("left", "joined"))

     user order                left              joined
 1: User1     1 2016-02-19 19:13:32 2016-02-19 19:13:26
 2: User1     3 2016-02-19 19:25:26 2016-02-19 19:21:33
 3: User1     5 2016-02-19 19:42:16 2016-02-19 19:35:38
 4: User1     7 2016-02-19 19:47:59 2016-02-19 19:44:15
 5: User1     9 2016-02-19 19:51:06 2016-02-19 19:48:55
 6: User1    11 2016-02-19 20:48:22 2016-02-19 19:52:40
 7: User1    13 2016-02-19 21:11:13 2016-02-19 21:06:20
 8: User1    15 2016-02-19 21:17:33 2016-02-19 21:11:15
 9: User2     1 2016-02-19 19:30:30 2016-02-19 19:21:18
10: User3     1 2016-02-19 20:02:26 2016-02-19 19:53:15
11: User3     3 2016-02-19 20:13:38 2016-02-19 20:02:34
12: User3     5 2016-02-19 20:42:27 2016-02-19 20:13:48
13: User3     7                  NA 2016-02-19 20:49:31
14: User3     8                  NA 2016-02-19 22:30:30
15: User4     1 2016-02-19 21:10:43 2016-02-19 20:59:58
16: User4     3 2016-02-19 22:02:45 2016-02-19 21:11:22
17: User4     5 2016-02-19 22:05:37 2016-02-19 22:05:18
18: User4     7                  NA 2016-02-19 22:05:47

score 5 · Answer 3 · answered Mar 11 '16 at 05:12

Base version:

samp$count <- with(samp, ave(as.character(user),list(state,user),FUN=seq_along) )

out <- merge(
  samp[samp$state=="joined",c("user","datetime","count")],
  samp[samp$state=="left",c("user","datetime","count")],
  by=c("user","count"), all.x=TRUE
)

out[order(out$count),]

ytk · Answer 4 · 2016-03-11T05:40:26.170

3

Another way to do it:

library(tidyr)
df <- df %>% spread(state, datetime)

df_joined <- df[!is.na(df$joined), 2:3]
df_joined <- df_joined[with(df_joined, order(user, joined)), ]

df_left <- df[!is.na(df$left), c(2, 4)]
df_left <- df_left[with(df_left, order(user, left)), ]

merge(df_joined, df_left, all = TRUE, by = 'user')

edited Mar 11 '16 at 05:40

answered Mar 11 '16 at 05:27

ytk

2,787
4
27
42

A. Webb · Answer 5 · 2016-03-11T06:47:35.913

We need a sequence number that determines the order of datetime within each user+state group. The sequence number used here, in particular, it is a meaningful consecutive count of joined-[left] records in the reshaped data frame.

Using spread from tidyr

spread(within(samp[,-1],seq<-ave(as.numeric(datetime),user,state,FUN=order)),
  state,datetime)



    user seq              joined                left
1  User1   1 2016-02-19 19:13:26 2016-02-19 19:13:32
2  User1   2 2016-02-19 19:21:33 2016-02-19 19:25:26
3  User1   3 2016-02-19 19:35:38 2016-02-19 19:42:16
4  User1   4 2016-02-19 19:44:15 2016-02-19 19:47:59
5  User1   5 2016-02-19 19:48:55 2016-02-19 19:51:06
6  User1   6 2016-02-19 19:52:40 2016-02-19 20:48:22
7  User1   7 2016-02-19 21:06:20 2016-02-19 21:11:13
8  User1   8 2016-02-19 21:11:15 2016-02-19 21:17:33
9  User2   1 2016-02-19 19:21:18 2016-02-19 19:30:30
10 User3   1 2016-02-19 19:53:15 2016-02-19 20:02:26
11 User3   2 2016-02-19 20:02:34 2016-02-19 20:13:38
12 User3   3 2016-02-19 20:13:48 2016-02-19 20:42:27
13 User3   4 2016-02-19 20:49:31                <NA>
14 User3   5 2016-02-19 22:30:30                <NA>
15 User4   1 2016-02-19 20:59:58 2016-02-19 21:10:43
16 User4   2 2016-02-19 21:11:22 2016-02-19 22:02:45
17 User4   3 2016-02-19 22:05:18 2016-02-19 22:05:37
18 User4   4 2016-02-19 22:05:47                <NA>

This may also be written with dcast from reshape2

dcast(within(samp,seq<-ave(as.numeric(datetime),user,state,FUN=order)),
  user+seq~state, value.var="datetime")

score 1 · Answer 6 · answered Sep 13 '19 at 22:39

since tidyr 1.0.0 the following is possible :

suppressPackageStartupMessages(library(tidyverse))
pivot_wider(samp[-1], names_from = "state", values_from = "datetime", 
            values_fn = list(datetime = list)) %>%
  mutate(left = map2(left, lengths(joined),`length<-`)) %>%
  unchop(everything())

#> # A tibble: 18 x 3
#>   user  joined              left               
#>   <fct> <fct>               <fct>              
#>  1 User1 2016-02-19 19:13:26 2016-02-19 19:13:32
#>  2 User1 2016-02-19 19:21:33 2016-02-19 19:25:26
#>  3 User1 2016-02-19 19:35:38 2016-02-19 19:42:16
#>  4 User1 2016-02-19 19:44:15 2016-02-19 19:47:59
#>  5 User1 2016-02-19 19:48:55 2016-02-19 19:51:06
#>  6 User1 2016-02-19 19:52:40 2016-02-19 20:48:22
#>  7 User1 2016-02-19 21:06:20 2016-02-19 21:11:13
#>  8 User1 2016-02-19 21:11:15 2016-02-19 21:17:33
#>  9 User2 2016-02-19 19:21:18 2016-02-19 19:30:30
#> 10 User3 2016-02-19 19:53:15 2016-02-19 20:02:26
#> 11 User3 2016-02-19 20:02:34 2016-02-19 20:13:38
#> 12 User3 2016-02-19 20:13:48 2016-02-19 20:42:27
#> 13 User3 2016-02-19 20:49:31 NA                 
#> 14 User3 2016-02-19 22:30:30 NA                 
#> 15 User4 2016-02-19 20:59:58 2016-02-19 21:10:43
#> 16 User4 2016-02-19 21:11:22 2016-02-19 22:02:45
#> 17 User4 2016-02-19 22:05:18 2016-02-19 22:05:37
#> 18 User4 2016-02-19 22:05:47 NA

values_fn is set to store multiple values for a given user in a list
Because thes don't have the same length we complete the short ones with NAs using mutateand length<-
Then we unnest vertically by using unchop

Reshaping data in R with "login" "logout" times

6 Answers6

Linked

Related