# Time series summary

Stack Overflow Asked by naanan_ on January 5, 2022

I was trying to sum numbers whose time lag is 1. i.e. I would like to summarize the rows by adding the frequencies of values where the days differ only by a single day within a particular group. I used the lag function to get the diff, but not sure how to proceed from here.

df <- df %>%
group_by(group) %>%
mutate(diff = dt - lag(dt))

df[!is.na(df$diff) & df$diff > 1,]$diff <- NA  For ex:  group dt freq diff groupA 2016-03-21 1 NA groupA 2016-03-22 1 1 groupA 2016-03-23 1 1 groupA 2016-03-26 2 NA groupA 2016-03-28 1 NA groupA 2016-03-29 3 1 groupA 2016-03-30 3 1 groupA 2016-03-31 5 1 groupB 2016-04-01 1 NA groupB 2016-04-02 2 1  I need to group this into: group dt freq diff duration groupA 2016-03-21 1 NA 3 (1 + 1 + 1) groupA 2016-03-22 1 1 groupA 2016-03-23 1 1 groupA 2016-03-26 2 NA 2 groupA 2016-03-28 1 NA 12(1 + 3 + 3 + 5) groupA 2016-03-29 3 1 groupA 2016-03-30 3 1 groupA 2016-03-31 5 1 groupB 2016-04-01 1 NA 3(1 + 2) groupB 2016-04-02 2 1  Also referred to this, but cumulative does not work as I do not consider jumps more than a single day apart. Is looping in a custom function the only way? ## 2 Answers You can do it much easier with this approach (grouping rows with less.than 1 day difference); this will create a helper column gap which later will be used to sum the freq for consecutive days in the same group: library(dplyr) df %>% mutate(gap = cumsum(!c(TRUE, diff(as.Date(df$dt)) == 1)))  %>%
group_by(gap, group) %>%
mutate(duration = sum(freq, na.rm=TRUE)) %>%
ungroup %>% select(-gap) %>% as.data.frame

#            group         dt freq duration
#        1  groupA 2016-03-21    1        3
#        2  groupA 2016-03-22    1        3
#        3  groupA 2016-03-23    1        3
#        4  groupA 2016-03-26    2        2
#        5  groupA 2016-03-28    1       12
#        6  groupA 2016-03-29    3       12
#        7  groupA 2016-03-30    3       12
#        8  groupA 2016-03-31    5       12
#        9  groupB 2016-04-01    1        3
#        10 groupB 2016-04-02    2        3


Answered by M-- on January 5, 2022

Here is a tidyverse solution using dplyr::lead:

library(tidyverse);
df %>%
mutate(dt = as.POSIXct(dt)) %>%
group_by(group) %>%
mutate(
diff = pmin(c(1, diff(dt)), c(1, diff(lead(dt))), na.rm = T),
id = cumsum(c(TRUE, diff(diff) != 0) | diff > 1)) %>%
group_by(group, id) %>%
mutate(duration = sum(freq)) %>%
ungroup() %>%
select(-diff, -id)
## A tibble: 10 x 4
#   group  dt                   freq duration
#   <fct>  <dttm>              <int>    <int>
# 1 groupA 2016-03-21 00:00:00     1        3
# 2 groupA 2016-03-22 00:00:00     1        3
# 3 groupA 2016-03-23 00:00:00     1        3
# 4 groupA 2016-03-26 00:00:00     2        2
# 5 groupA 2016-03-28 00:00:00     1       12
# 6 groupA 2016-03-29 00:00:00     3       12
# 7 groupA 2016-03-30 00:00:00     3       12
# 8 groupA 2016-03-31 00:00:00     5       12
# 9 groupB 2016-04-01 00:00:00     1        3
#10 groupB 2016-04-02 00:00:00     2        3


Explanation: diff chooses the minimum difference between the preceding and following date. We then look for changes in diff, and create a new grouping vector id by which we calculate the summary metric sum(freq).

## Sample data

df <- read.table(text =
" group     dt           freq  diff
groupA    2016-03-21    1     NA
groupA    2016-03-22    1     1
groupA    2016-03-23    1     1
groupA    2016-03-26    2     NA
groupA    2016-03-28    1     NA
groupA    2016-03-29    3     1
groupA    2016-03-30    3     1
groupA    2016-03-31    5     1
groupB    2016-04-01    1     NA
groupB    2016-04-02    2     1 ", header = T)


## Update

# Sample data
" group     dt           freq  diff
groupA    2016-03-21    1     NA
groupA    2016-03-22    1     1
groupA    2016-03-23    1     1
groupA    2016-03-26    2     NA
groupA    2016-03-28    1     NA
groupA    2016-04-01    3     1
groupA    2016-04-02    3     1
groupA    2016-04-03    5     1
groupB    2016-04-01    1     NA
groupB    2016-04-02    2     1 ", header = T)

df %>%
mutate(dt = as.POSIXct(dt)) %>%
group_by(group) %>%
mutate(
diff = pmin(c(1, diff(dt)), c(1, diff(lead(dt))), na.rm = T),
id = cumsum(c(TRUE, diff(diff) != 0) | diff > 1)) %>%
group_by(group, id) %>%
mutate(duration = sum(freq)) %>%
ungroup() %>%
select(-diff, -id);
## A tibble: 10 x 4
#   group  dt                   freq duration
#   <fct>  <dttm>              <int>    <int>
# 1 groupA 2016-03-21 00:00:00     1        3
# 2 groupA 2016-03-22 00:00:00     1        3
# 3 groupA 2016-03-23 00:00:00     1        3
# 4 groupA 2016-03-26 00:00:00     2        2
# 5 groupA 2016-03-28 00:00:00     1        1
# 6 groupA 2016-04-01 00:00:00     3       11
# 7 groupA 2016-04-02 00:00:00     3       11
# 8 groupA 2016-04-03 00:00:00     5       11
# 9 groupB 2016-04-01 00:00:00     1        3
#10 groupB 2016-04-02 00:00:00     2        3


Answered by Maurits Evers on January 5, 2022