Cross Validated Asked by NclsK on November 2, 2021
I am trying to fit a linear model using "hour of the day" as parameter. What I’m struggling with, is, that I’ve found two possible solutions on how to handle this:
Dummy encoding for every hour of the day
I don’t quite understand the use cases of both approaches and thus I am not certain which one will lead to a better outcome.
The Data I’m using is from this Kaggle challenge. The goal is to predict nyc taxi fares. Given attributes are pickup and dropoff coordinates, pickup datetime, passenger count and the fare amount.
I extracted the hour of the day to take possible congestions into consideration and am trying to implement it into my model. I should also probably mention that I’m pretty inexperienced.
+1 to gunes' answer. Dummy coding will indeed disregard the distance between time points - the responses between two time points 1 hour apart will be more alike than between two time points 3 hours apart, and dummy coding completely discards this piece of information.
Dummy encoding fits a step-like time dependency: the response is flat for one hour, and then it suddenly jumps (and the jump is unconstrained except for what the data tells us - this is a consequence of the lack of proximity modeled). Both aspects are ecologically extremely doubtful:
Here is an additional aspect. If you bucketize your day into 24 hours, then you need to fit 23 parameters in addition to the intercept. This is a lot, and you will need a huge amount of data to reliably fit this without running afoul of the bias-variance tradeoff.
An alternative would be to use a Fourier-type model with harmonics. For instance, assume your observation timestamp $t$ corresponds to a time of day $tau(t)$ (so when going from $t$ to $tau(t)$, we simply drop the day, month and year information from $t$). Then you can transform the time impact into sines and cosines:
$$ sinbig(2pi kfrac{tau(t)}{24}big), quadcosbig(2pi kfrac{tau(t)}{24}big). $$
A simple model would go up to $k=3$:
$$ y_t = beta_0+sum_{k=1}^3 beta_ksinbig(2pi kfrac{tau(t)}{24}big) + sum_{k=1}^3gamma_kcosbig(2pi kfrac{tau(t)}{24}big) + text{other covariates}+epsilon_t. $$
This already gives you a lot of flexibility at the cost of fitting only 6 parameters, so your model will be far more stable. Also, you will get neither the constant response within an hour, nor the abrupt steps when a new hour starts. Here are some random examples of time courses this can fit:
Of course, regardless of what choice you make, you should think about including any additional pieces of information you know (e.g., if all theaters and cinemas start or finish their shows at the same point in time, then mark this with a dummy, because then you will get a sharp step change, at least in the relevant districts). Also, the time response will certainly differ between weekdays and weekends, and likely also between Fridays and other weekdays, so include interactions between your time model and the day of week. Or look into models for multiple-seasonalities to address this.
R code for my plots:
par(mai=c(.8,.1,.1,.1))
plot(c(0,24),c(0,1),yaxt="n",xlab="Hour",ylab="")
lines(c(0,rep(1:23,each=2),24),rep(runif(24),each=2))
tau <- seq(0,24,by=.001)
mm <- cbind(1,sin(2*pi*1*tau/24),sin(2*pi*2*tau/24),sin(2*pi*3*tau/24),cos(2*pi*1*tau/24),cos(2*pi*2*tau/24),cos(2*pi*3*tau/24))
par(mai=c(.8,.1,.1,.1),mfrow=c(3,2))
for ( ii in 1:6 ) plot(tau,(mm%*%runif(7,-1,1))[,1],yaxt="n",xlab="Hour",ylab="",type="l")
Answered by Stephan Kolassa on November 2, 2021
For a time series regression, simply adding hourly dummies $D_h, h = 0,cdots, 23$, is the natural thing to do in most cases, i.e. fit the model $$ y_t = beta_0 D_0 + cdots + beta_{23}D_{23} + mbox{ other covariates } + epsilon_t. $$ As a modeler, you're simply saying that the dependent variable $y_t$ has a hourly-dependent average $beta_h$ at hour $h$, plus the effect from other covariates. Any hourly (additive) seasonality in the data would be picked up by this regression. (Alternatively, seaonsality can be modeled multiplicatively by, say, a SARMAX-type model.)
Transforming the data by some arbitrary periodic function (sin/cos/etc) is not really appropriate. For example, say you fit the model $$ y_t = sum_{h=0}^{23} beta_{h}sin(2 pi frac{h(t)}{24}) + mbox{ other covariates } + epsilon_t, $$ where $h(t) = 12$ if observation $y_t$ is sampled at the 12th hour of the day (for example). Then you're imposing a peak at hour $h = 6$ (or whenever, by transforming the sine function) on the data, arbitrarily.
Answered by Michael on November 2, 2021
Dummy encoding would destroy any proximity measure (and ordering) among hours. For example, the distance between 1 PM and 9 PM would be the same as the distance between 1 PM and 1 AM. It'd be harder to say something like around 1 PM.
Even leaving them as is, e.g. numbers in 0-23, would be a better approach than dummy encoding in my opinion. But, this way has a catch as well: 00:01 and 23:59 would be seen very distant but actually they're not. To remedy this, your second listed approach, i.e. cyclic variables, is used. Cyclic variables map hours onto a circle (like a 24-h mechanical clock) so that the ML algorithm can see the neighbours of individual hours.
Answered by gunes on November 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP