Data Science Asked by Sumit Verma on June 24, 2021
Consider the following scenario.
I have trained a K-Means model on some input features, say, (A, B, C, D and E).
Now at the time of making predictions I want to make the model predict using only fewer number of features e.g. (A, D, E) as opposed to the number of features it was trained initially i.e. (A, B, C, D and E).
With the understanding of above facts I have some questions which are as follows :-
Interesting question. The answer is: It depends.
The best way to find out how it would affect your model is with the shap package. You can use it to uncover the importance of features and reveal interaction effects in the model.
There could be a very different effect depending on how „important“ the excluded features are.
Let‘s assume a very simple decision tree model where your most important features and rules would be represented by the top-n splits (and so on..). If you would want to make your model more generic (i.e. prevent overfitting) you would also perform a pruning on the tree (cut the less important features/rules in the tree). The pruning wouldn’t affect your model performance (no dramatic loss of accuracy). In contrast if you would exclude one of the top features (or just provide a static value) it would have a negative impact on your model predictions.
Answered by Predicted Life on June 24, 2021
- Is this approach a correct approach, or logical with respect to machine learning principles?
It will affect the performance of the model in the sense that your algorithm learned to separate the clusters based upon distance according to all the features. I have read discussions about how to calculate feature importance on unsupervised problems like yours, so you could do some research on that and figure out a way to measure the importance of your features so that you have an idea of how influential is a feature on your model and therefore the impact on removing one. In this case removing will imply filling the features you are not using with NaNs so your model must be prepared for such scenario (sklearn pipelines are the best way of doing this)
- Will it affect the model accuracy and if yes then how?
First of all, you are referring to an unsupervised model (K-means) so the metrics you mention does not apply, rather there must be metrics on the separateness of the clusters you formed (silhouette score, etc) and according to my first answers, you may adapt a version of Permutation importance using a metric according to your problem to see how it impacts to remove a feature the general performance.
- If I have to provide features B and C, then can I populate them with zeros and then provide it to trained model for making predictions
Remember you are using an algorithm that is based on euclidian distance so imputing with zeros may have an undesired result
- Will the action taken in step 3) affect the model accuracy, if yes and then why and how?
Sure it will, imputing with zeros will take the features not present to the origin on the euclidean space, so be careful with that
Answered by Julio Jesus on June 24, 2021
The other answers make sense but I would be more categorically negative about the idea:
- Is this approach a correct approach, or logical with respect to machine learning principles ?
No, it's not. The parameters of a ML model (whether supervised or unsupervised) are estimated using a particular set of features designed as the input for the problem. Changing the input (features) changes the definition of the problem as well, so the solution (model) obtained for the first problem is unlikely to work as well on the new problem.
- Will it affect the model accuracy and if yes then how ?
It's very likely to decrease the performance of the model.
Normally the features used in the model are chosen because they are likely to "help" the model. If they are "helpful" then the model will rely on them, and therefore removing them will cause the model to fail.
- If I have to provide features B and C, then can I populate them with zeros and then provide it to trained model for making predictions.
You sure can, but it's a bad idea.
- Will the action taken in step 3) affect the model accuracy, if yes and then why and how ?
Same as point 2: the performance is very likely to drop. Replacing valuable indications for the model with arbitrary values is the equivalent of randomly switching blood samples in a biology lab, it causes wrong tests and wrong results.
Another way to look at it: if what you propose was possible, it would mean that it's possible in general to remove one feature and obtain the same performance. So let's say we have performance P with features (A,B,C,D,E), and when we remove A we still have performance P. Then by our assumption we can also remove B and still obtain performance P, and then do it again until we obtain a model with 0 features which has performance P. This is a contradiction, so the hypothesis that it's possible to remove a feature without losing performance is false.
Answered by Erwan on June 24, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP