Data Science Asked by Siddhant Tandon on January 1, 2021
I have a very silly doubt about potential leakage of information during train test splitting. My dataset is a timeseries of multiple features for the year of 2018 where every row are observations taken at every 3 second timestamp. The data goes into a function which filters the data between a start,end
date range and produces some grouped aggregations df.groupby(col).agg('mean','meadian'....)
where col
is a categorical var. The start,end
date ranges are completely exclusive and dont overlap for ex : 2018-01-01:2018-15-01
,2018-16-01:2018-30-01
…and so on. So the windows produced are not overlapping at all.
I save these batches on the disk and load all of them into a single dataframe and then do a random train test split. Do you think this way I am leaking some info in the test set ?
Moreover I want to scale my features before producing the aggregations. So there has to be a fitted scaler at the very first step.
I came up with two approaches that I could follow to ensure the process error prone.
First:
1. Split the dataset by time order into train test
2. Apply scaler on train and transform on test
3. Then generate aggregated data for both train test
4. train on aggregated train predict on aggregated test
5. For a new test set , transform it using fitted scaler, generate the aggregated test for and predict on it
Second:
1. take the whole dataset and for every timeframe window of 15 days apply scaler fit method and transform
2. produce the aggregations on the 15 days batched data.
3. write the batches to disk and load them all into a dataframe and simply do a random train test split.
4. for a new test set, make timeframe windows of 15 days apply scaler fit and transform on the test and then produce aggregations and then simply predict on it
Which of the two approaches i should try ? If I consider the first approach should I do it for multiple time folds , like first 1-5 months train then next 6-7 months test, 2-6 months train and 7-8 months test and so on ?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP