Cross Validated Asked by alajeb on January 21, 2021
I have a dataset that contains 23 features, these features model the web traffic of two protocoles. Some features are extracted and the other are statistically calculated features. I want to build a classification model that predict to which protocol an observation belongs.
I started with visualizing the box plots of my variables and I got the following results
I want to know what interpretations can I extract from these plots? Should I eleminate these outliers ?
Before you start to manually select features, I would first try out what the leave-one-out error rate is of a simple classifier like kNN (R function knn from library *class) or a normal distribution model (aka "quadratic discriminant analysis", R functions qda and predict.qda from library MASS). If this does not yoield decent results, you can try to eliminate features, e.g. by greedy backward selection.
Note that your features all seem to cover the same numeric range, but if this is not the case, you might consider standardizing them beforehand to a fixed range (or, alternatively, to varianc one).
Answered by cdalitz on January 21, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP