Data Science Asked by Ruan Putka on August 13, 2021
What I am doing:
I am predicting product ratings using boosted trees (XGBoost) with a dataset in this format:
What I want to do:
I want to use SHAP TreeExplainer to interpret each prediction my model gives in terms of product attributes and user ids.
What I am getting:
My model is drawing all the conclusions based on product names and user ids, instead of product attributes and user ids.
What I tried:
I discovered that each product name has a unique combination of product attributes, i.e. by knowing the product attributes you can find its name. So my idea was to remove the product_name
column, leaving only the attributes.
My reasoning was that restructuring the dataset in this way would lead to the interpretability that I wanted without any performance loss (since the product name doesn’t add any new information).
What I got:
The model performance decreased a lot. Even with a great deal of hyperparameter tuning, I couldn’t get near the performance I had when also using the product name.
What I think maybe going on:
or
I am a little skeptical about the number 2, seeing that my training loss also went up when I removed the product name.
My question:
So, how can I restructure my dataset? Does anybody have a clue why my model can’t reach the same performance without using the product name? Any light or ideas on what I can try?
What may be happening is that your attribute predictors are weak predictors, they are noisy. Meaningful decision trees can't be made out of product attribute features by xgb.
When you are adding name as a predictor, xgb finds some signal wrt your target variable - rating and thus you get a better score. So your name plus attributes model may be performing better than attributes only model for this reason.
So if you from domain experience know product attributes are very weakly related to rating then you can conclude that this feature set of attributes is not going to help you make accurate predictions. Or instead of relying on d omain expertise, you can use correlation or relevant statistical tests to understand attributes relation to rating and if found that relationship is non existent or very weak you can conclude model isn't possible.
So may be add more relevant features if possible if you want to make a reasonably good model.
Regards Vik
Answered by Vikrant Arora on August 13, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP