Data Science Asked on March 24, 2021
My project involves trying to predict the sales quantity for a specific item across a whole year. I’ve used the LightGBM package for making the predictions. The params I’ve set for it are as follows:
params = {
'nthread': 10,
'max_depth': 5, #DONE
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression_l1',
'metric': 'mape', # this is abs(a-e)/max(1,a)
'num_leaves': 2, #DONE
'learning_rate': 0.2180, #DONE
'feature_fraction': 0.9, #DONE
'bagging_fraction': 0.990, #DONE
'bagging_freq': 1, #DONE
'lambda_l1': 3.097758978478437, #DONE
'lambda_l2': 2.9482537987198496, #DONE
'verbose': 1,
'min_child_weight': 6.996211413900573,
'min_split_gain': 0.037310344962162616,
'min_data_in_bin': 1, #DONE
'min_data_in_leaf':2, #DONE
'num_boost_round': 1, #DONE
'max_bin': 7, #DONE
'extra_trees': True, #DONE
'early_stopping_rounds':-1
}
My dataset consists of daily sales data (columns= date, quantity) for the years 2017, 2018, 2019 and 3 months of 2020. I’ve been trying to use the 2017 and 2018 data for training and cross-validation and trying to test it for 2019 data. However my predictions for the year is way off the mark while considering the quantities on a weekly, monthly, quarterly or yearly basis (error ~ 40-50%)(I’ve tuned the params to bring the error down to this values). Moreover while considering the predictions, my r2_score is giving me a negative value of around -2.9148426301633803
. Any suggestions on what can be done to make it better?
Script for lightgbm:
lgb_train = lgb.Dataset(train_x, train_y)
lgb_valid = lgb.Dataset(test_x, test_y)
model = lgb.train(params, lgb_train,
valid_sets=[lgb_train, lgb_valid],
verbose_eval=50)
test_df_pred = df[(df.date >= '2019-01-01') & (df.date < '2020-01-01')]
#test_df_pred = df[(df.date >= '2019-01-01') & (df.date < '2019-02-01')]
#test_df_pred = df[(df.date >= '2019-01-15') & (df.date < '2019-01-22')]
test_df_pred['month'] = test_df_pred['date'].dt.month
test_df_pred['day'] = test_df_pred['date'].dt.dayofweek
test_df_pred['year'] = test_df_pred['date'].dt.year
col = [i for i in test_df_pred.columns if i not in ['date','id', 'qty']]
y_test_pred = model.predict(test_df_pred[col])
test_df_pred['qty_pred'] = y_test_pred
mse = mean_squared_error(y_true=test_df_pred['qty'], y_pred=test_df_pred['qty_pred'])
mae = mean_absolute_error(y_true=test_df_pred['qty'], y_pred=test_df_pred['qty_pred'])
mape = mean_absolute_percentage_error(y_true=test_df_pred['qty'], y_pred=test_df_pred['qty_pred'])
qty = test_df_pred.qty.sum()
qty_pred = test_df_pred.qty_pred.sum()
diff = qty_pred - qty
I assume you are new to the field, thus, I would suggest using tutorials to achieve your goal. Because what you did is completely wrong and your approach is incorrect. I guess you want to model the sales as time series without using any predictor instead you want to model future values by looking at the past values. To achieve that, you need to use algorithms like ARIMA, exponential smoothing, etc. Here what you have done is trying to correlate the year, month, and day with the sales, which does not possess any information about the sale as expected (also you decoded it wrongly). Thus, your performance metric shows you a negative result. As a reference, check these which are similar to your problem. Source1, Source2, Source3. These will solve your issue.
Answered by Shahriyar Mammadli on March 24, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP