I have a database of tweets which I am currently downloading. I want to assign factor for each tweet based on his timestamp. However, this problem looks quite more challenging than I expected.
My example looks like this:
library(tidyverse)
library(lubridate)
creating boundaries:
start_time<-ymd_hms("2017-03-09 9:30:00", tz="EST")
end_time<-start_time+days()
start_time<-as.numeric(start_time)
end_time<-as.numeric(end_time)
creating first table. this table represents the table with tweets. In my PC, one day represents around 1M tweets, with around 1700 different timestamps:
example_times<-sample(start_time:end_time, 15)
example_table<-as.data.frame(rep(example_times, 200))
example_table$var<-as.character(as.roman(1:dim(example_table)[1]))
colnames(example_table)<-c("unix_ts", "text")
example_table$unix_ts<-as.POSIXct(example_table$unix_ts, origin=origin)
creating second table, from which I take times and factor which should be assigned to each of the tweets. At this moment I have only two classes, however I would like to create more in the future:
breaks<-c(1489069800, 1489071600, 1489073400, 1489075200, 1489077000, 1489078800,
1489080600, 1489082400, 1489084200, 1489086000, 1489087800, 1489089600,
1489091400, 1489093200, 1489156200)
classes<-c('DOWN', 'UP', 'UP', 'UP', 'UP', 'DOWN', 'UP', 'UP', 'UP', 'DOWN', 'DOWN', 'DOWN', 'UP', 'DOWN', 'UP')
key<-data.frame(breaks, classes, stringsAsFactors = FALSE)
key$breaks<-as.POSIXct(breaks, origin = origin)
key<-key%>% mutate("intrvl"=interval(lag(breaks), breaks))
my try to solve this problem looks like this:
assign_group<-function(unix_time){
result<-key %>%
filter(unix_time %within% key$intrvl) %>%
select(classes) %>%
unlist
names(result)<-NULL
return(result)
}
sapply(example_table$unix_ts, assign_group)
this example is small and this solution should work here quite fast, however it is unmanagable when having dataset of 1M tweets. And even though it is big, there is only 1500 different timestamps, which I need to classify using assign_group. Could you please provide me with faster solution?