Data Science Asked by mkoeck on August 3, 2021
For one of my projects I needed to create classification models for each of many products. In order to see which classifier performs best, I created one SVM, RandomForest and Naive Bayes model for each product doing a 5-fold cross validation in the process. I saved the model and the scores for each fold (so 15 models and scores per product).
Now I want to take a look why some models perform much better than others. For this I plan to do a linear regression, where the y value is the model score (e.g. precision, recall, …) and the x values are one-hot encoded categorical variables. For illustrative purposes lets say there are two categorical variables: color = ["red", "green", "yellow"], age_group = ["children", "teenager", "adults"]. Basically what I want to see is if there is a significant difference in the score depending on the color and/or the age_group.
From this I encountered the following questions:
What would signify one sample in my linear regression? Does it make sense to take each fold as its own sample? I expect that this is not the correct way since then samples would not be independent. Instead my approach would be to take the best performing estimator of the three and use the mean score as a sample. Therefore I would only have one sample per product. Alternatively, I think it might also be fine to introduce a third categorical variable which signifies the estimator. Then I could include the mean score of the 5 folds for each estimator in the samples, therefore including 3 samples per product (again here I am not sure if I would violate independence of the samples).
Thank you very much for your help!
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP