What is the best way to treat datetime in the preprocessing step of machine learning

Question

I have two datetime columns in my dataset. What I have done so far I have extracted year, month, dayofweek and hourofday from these columns.
So as you expect they will be something like this:
2015-5-5 08:21:20   ----> 2015   5   5   8

So my question is that what is the best way to normalize these numbers. Because I think that year or other numbers will dominate my machine learning model.
I have not found any article regarding this, all explaining till this point that we convert them to year month ...
Thanks.

Noah Weber · Answer

You already have a good beginning. Transforming data into 4 columns. Year, month, day and hour. Now These 4 are all categoricals, you can than just apply one hot Encoding. Than no Domination will happen.

Answered by Noah Weber on December 22, 2020

honeybees · Answer

It depends on what the task is (what you are trying to predict) and how the date relates or could relate to that task. If you are trying to predict cancer risk and you have the datetime of someone's birth, the time and day portion are probably irrelevant (the month too, possibly). In this scenario it makes more sense to convert the datetime to a person's age.
In other scenarios you could consider binning, for example splitting the year into 4 seasons instead of 12 months * ~30 days. And/or splitting the day into morning/noon/evening/night instead of 24 hours.
You could also convert the datetime to epoch time to obtain a single number / feature. You can look at some ways to do this in this answer.
There are more possibilities, but it really depends on what you are trying to achieve. One-hot encoding everything is not a good option in most cases.

What is the best way to treat datetime in the preprocessing step of machine learning

2 Answers

Add your own answers!

Ask a Question