Data Science Asked by frantic oreo on August 16, 2021
I am trying to stop my RF from overfitting. I am using time series data with 1 day time lag, to predict the current price. I am using this function to shift my independent features back 1 day:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
#Ref: https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
n_vars = 1 if type(data) is list else data.shape[1]
col_names = data.columns
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [(f'{col_names[j]}_t-{i}') for j in range(n_vars)]
agg = concat(cols, axis=1)
agg.columns = names
if dropnan:
agg.dropna(inplace=True)
return agg
I then fit my RF, I have attempted many different parameters however the model still overfits.
n_lags = 1
lagged_X = series_to_supervised(df, n_in=n_lags, n_out=0, dropnan=True)
y = df[RESPONSE_VAR][n_lags:]
X_train, X_test, y_train, y_test = train_test_split(
lagged_X, y, test_size=0.2, random_state=42,
shuffle=False)
rf = RandomForestRegressor(n_estimators=1, max_depth=1, max_features=1)
rf.fit(X_train, y_train)
preds_train = rf.predict(X_train)
preds_test = rf.predict(X_test)
mae_train = mean_absolute_error(y_train, preds_train)
mae_test = mean_absolute_error(y_test, preds_test)
print(f'mae_train: {mae_train}')
print(f'mae_test: {mae_test}')
> mae_train: 683.4959502405592
> mae_test: 2491.3775235802696
My lagged_X.shape
is (3179, 220)
. I would have assumed constraining the model to only 1 tree and 1 feature, that the model would not be able to overfit? I have displayed the rf.feature_importances_
and attempted to drop too informative columns (?) with higher amounts of estimators and less feature constraints however, this also did not work.
Can I get my Train and Test scores to converge?
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP