Data Science Asked by user3792245 on December 5, 2020
I am working with a multivariate time-series dataset and have put together a Random Forest code (see below) to forecast the variable TM at a future time (by training the model using data pertaining to two variables FL and TM). I know that the two parameters are closely correlated.
I was not sure if I got the code right initially, but after the training, when I tested it on the test data, I got an R$^2$ value of 0.98! This is my first attempt at multivariate forecasting and I was pleasantly surprised and hoped that I did not make any mistakes along the way.
So, before I proceed with this, I was wondering if someone could help double check my code below and advise if anything is obviously wrong that I might have overlooked. Thanks!
Here is a link to the data.
Also, any help with how I could make this multi-step, so I can forecast TM at any time $t+1, t+2,….t+n$, would be greatly appreciated!
import seaborn as sns
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd
import datetime
def table2lags(table, max_lag, min_lag=0, separator='_'):
""" Given a dataframe, return a dataframe with different lags of all its columns """
values=[]
for i in range(min_lag, max_lag + 1):
values.append(table.shift(i).copy())
values[-1].columns = [c + separator + str(i) for c in table.columns]
return pd.concat(values, axis=1)
data = pd.read_csv('Data_UW.csv')
data['DATE_TIME'] = pd.to_datetime(data.Date + ' ' +data['Time'])
for obs in range(1,6):
data["T_"+str(obs)] = data.TM.shift(obs)
data.fillna(0.00,inplace=True)
training_data = data[data.DATE_TIME<pd.to_datetime('03/06/2019')]
test_data = data[data.DATE_TIME>=pd.to_datetime('03/06/2019')]
val_mask = (data.DATE_TIME>=pd.to_datetime('03/01/2019')) & (data.DATE_TIME<pd.to_datetime('03/02/2019'))
val_data = data.loc[val_mask]
clean_train=training_data[['DATE_TIME', 'TM', 'FL']]
clean_test=test_data[['DATE_TIME', 'TM', 'FL']]
clean_val=val_data[['DATE_TIME', 'TM', 'FL']]
X_train = table2lags(clean_train[['TM', 'FL']], 2)
X_train.TM_1[0] = 0
X_train.FL_1[0] = 0
X_train.TM_2[0] = 0
X_train.TM_2[1] = 0
X_train.FL_2[0] = 0
X_train.FL_2[1] = 0
X_train,y_train=X_train,training_data.TM
rfr = RandomForestRegressor(n_estimators=200,max_depth=20,criterion='mse',verbose=2,n_jobs=5)
rfr.fit(X_train,y_train)
X_test = table2lags(clean_test[['TM', 'FL']], 2)
X_test.TM_1[26803] = 0
X_test.FL_1[26803] = 0
X_test.TM_2[26803] = 0
X_test.TM_2[26804] = 0
X_test.FL_2[26803] = 0
X_test.FL_2[26804] = 0
X_test,y_test=X_test,test_data.TM
rfr.score(X_test,y_test)
test_data["TM_Pred"] = rfr.predict(X_test)
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP