TransWikia.com

Time series summary

Stack Overflow Asked by naanan_ on January 5, 2022

I was trying to sum numbers whose time lag is 1. i.e. I would like to summarize the rows by adding the frequencies of values where the days differ only by a single day within a particular group. I used the lag function to get the diff, but not sure how to proceed from here.

df <- df %>% 
  group_by(group) %>% 
  mutate(diff = dt - lag(dt))

df[!is.na(df$diff) & df$diff > 1,]$diff <- NA

For ex:

 group     dt           freq  diff  
 groupA    2016-03-21    1     NA    
 groupA    2016-03-22    1     1     
 groupA    2016-03-23    1     1     
 groupA    2016-03-26    2     NA     
 groupA    2016-03-28    1     NA     
 groupA    2016-03-29    3     1     
 groupA    2016-03-30    3     1     
 groupA    2016-03-31    5     1     
 groupB    2016-04-01    1     NA      
 groupB    2016-04-02    2     1 

I need to group this into:

group    dt         freq  diff  duration     
groupA  2016-03-21    1     NA    3 (1 + 1 + 1)     
groupA  2016-03-22    1     1         
groupA  2016-03-23    1     1         
groupA  2016-03-26    2     NA    2     
groupA  2016-03-28    1     NA    12(1 + 3 + 3 + 5)     
groupA  2016-03-29    3     1         
groupA  2016-03-30    3     1         
groupA  2016-03-31    5     1         
groupB  2016-04-01    1     NA    3(1 + 2)     
groupB  2016-04-02    2     1 

Also referred to this, but cumulative does not work as I do not consider jumps more than a single day apart. Is looping in a custom function the only way?

2 Answers

You can do it much easier with this approach (grouping rows with less.than 1 day difference); this will create a helper column gap which later will be used to sum the freq for consecutive days in the same group:

library(dplyr)

df %>% 
    mutate(gap = cumsum(!c(TRUE, diff(as.Date(df$dt)) == 1)))  %>% 
    group_by(gap, group) %>% 
    mutate(duration = sum(freq, na.rm=TRUE)) %>% 
    ungroup %>% select(-gap) %>% as.data.frame

#            group         dt freq duration
#        1  groupA 2016-03-21    1        3
#        2  groupA 2016-03-22    1        3
#        3  groupA 2016-03-23    1        3
#        4  groupA 2016-03-26    2        2
#        5  groupA 2016-03-28    1       12
#        6  groupA 2016-03-29    3       12
#        7  groupA 2016-03-30    3       12
#        8  groupA 2016-03-31    5       12
#        9  groupB 2016-04-01    1        3
#        10 groupB 2016-04-02    2        3

Answered by M-- on January 5, 2022

Here is a tidyverse solution using dplyr::lead:

library(tidyverse);
df %>%
    mutate(dt = as.POSIXct(dt)) %>%
    group_by(group) %>%
    mutate(
        diff = pmin(c(1, diff(dt)), c(1, diff(lead(dt))), na.rm = T),
        id = cumsum(c(TRUE, diff(diff) != 0) | diff > 1)) %>%
    group_by(group, id) %>%
    mutate(duration = sum(freq)) %>%
    ungroup() %>%
    select(-diff, -id)
## A tibble: 10 x 4
#   group  dt                   freq duration
#   <fct>  <dttm>              <int>    <int>
# 1 groupA 2016-03-21 00:00:00     1        3
# 2 groupA 2016-03-22 00:00:00     1        3
# 3 groupA 2016-03-23 00:00:00     1        3
# 4 groupA 2016-03-26 00:00:00     2        2
# 5 groupA 2016-03-28 00:00:00     1       12
# 6 groupA 2016-03-29 00:00:00     3       12
# 7 groupA 2016-03-30 00:00:00     3       12
# 8 groupA 2016-03-31 00:00:00     5       12
# 9 groupB 2016-04-01 00:00:00     1        3
#10 groupB 2016-04-02 00:00:00     2        3

Explanation: diff chooses the minimum difference between the preceding and following date. We then look for changes in diff, and create a new grouping vector id by which we calculate the summary metric sum(freq).


Sample data

df <- read.table(text =
    " group     dt           freq  diff
 groupA    2016-03-21    1     NA
 groupA    2016-03-22    1     1
 groupA    2016-03-23    1     1
 groupA    2016-03-26    2     NA
 groupA    2016-03-28    1     NA
 groupA    2016-03-29    3     1
 groupA    2016-03-30    3     1
 groupA    2016-03-31    5     1
 groupB    2016-04-01    1     NA
 groupB    2016-04-02    2     1 ", header = T)

Update

For your second example:

# Sample data
df <- read.table(text =
" group     dt           freq  diff
groupA    2016-03-21    1     NA
groupA    2016-03-22    1     1
groupA    2016-03-23    1     1
groupA    2016-03-26    2     NA
groupA    2016-03-28    1     NA
groupA    2016-04-01    3     1
groupA    2016-04-02    3     1
groupA    2016-04-03    5     1
groupB    2016-04-01    1     NA
groupB    2016-04-02    2     1 ", header = T)

df %>%
    mutate(dt = as.POSIXct(dt)) %>%
    group_by(group) %>%
    mutate(
        diff = pmin(c(1, diff(dt)), c(1, diff(lead(dt))), na.rm = T),
        id = cumsum(c(TRUE, diff(diff) != 0) | diff > 1)) %>%
    group_by(group, id) %>%
    mutate(duration = sum(freq)) %>%
    ungroup() %>%
    select(-diff, -id);
## A tibble: 10 x 4
#   group  dt                   freq duration
#   <fct>  <dttm>              <int>    <int>
# 1 groupA 2016-03-21 00:00:00     1        3
# 2 groupA 2016-03-22 00:00:00     1        3
# 3 groupA 2016-03-23 00:00:00     1        3
# 4 groupA 2016-03-26 00:00:00     2        2
# 5 groupA 2016-03-28 00:00:00     1        1
# 6 groupA 2016-04-01 00:00:00     3       11
# 7 groupA 2016-04-02 00:00:00     3       11
# 8 groupA 2016-04-03 00:00:00     5       11
# 9 groupB 2016-04-01 00:00:00     1        3
#10 groupB 2016-04-02 00:00:00     2        3        

Answered by Maurits Evers on January 5, 2022

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP