Cross Validated Asked by uared1776 on July 31, 2020
What is the general procedure for a combined task of model tuning (i.e., hyperparameter selection), feature selection and model selection?
I know some basic principles for each task, but when combining them all together, I am confused.
For example, let’s assume we have 1000 features to be selected from (already filtered after unsupervised methods) and 1000 samples, and the output labels are True/False binary outcome.
The candidate models considered here are k-nearest neighbors (KNN) model, and Support Vector Machines (SVM) with Linear Kernel, so the hyperparameters are the number of neighbors (k) in KNN and the cost (C) in SVM.
We would like to use the genetic algorithm (GA) to guide the search, and cross-validation to tune the hyperparameters during the search process.
Is the following procedure correct?
The following procedure 1 to 6 runs for each of the candidate models (i.e., KNN and SVM), individually:
Build a KNN model and a SVM model using steps 1 to 6, and the best model is selected using the statistics obtained in step 6.
The above procedure looks very complex, and the result is uncertain because the final feature selected by GA can be different in different runs. Is there another way to do the above tasks (combined feature selection, model tuning and model comparison)?
I did not use recursive feature elimination methods (e.g., backwards selection), because some of the predictors are correlated, and removing correlated predictors can cause loss of information in this particular problem. So I think GA search is more flexible, and thus is more suitable for the correlated predictors. Is it right?
This is a very good question and it is shameful that this has not been answered on this site. More content like this should be promoted in the community to minimise the number of junk articles that encourage poor practices.
Regarding your question (if you're still there!), the fact that some features are correlated is precisely why removing them is worthwhile. If two things are highly correlated that one can use the first to infer the second, then the second is unnecessary! You can drop that feature - and if you knew those features were correlated prior to seeing the data then you can take them out yourself. Otherwise, use RFE.
Otherwise I am not so sure ... As I understand it one picks a particular model with some parameter settings and then uses cross-validation on that specific model. I think what you are doing is having an outer-loop of cross-validation and an inner-loop of choosing the best parameters for the model. So you should switch the two. This is because you are trying to evaluate a given model, but if you try finding optimal parameters in each fold then you are not fixing the model and evaluating it, but rather finding the best model for a given fold of your dataset. You do not need a genetic algorithm to pick model parameters at the beginning. If you're using Python/Scikit Learn then GridSearchCV will be a good solution.
Hope this makes sense - please note I'm no genius or expert so do take my answer with a grain of salt. Hope this helps all the same!
Answered by Daniel Soutar on July 31, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP