Data Science Asked on March 29, 2021
My friend was reading a textbook and had this question:
Suppose that you observe $(X_1,Y_1),…,(X_{100}Y_{100})$, which you assume to be i.i.d. copies of a random pair $(X,Y)$ taking values in $mathbb{R}^2 times {1,2}$. Your plot the data and see the following:
where black circles represent those $X_i$ with $Y_i=1$ and the red triangles represent those $X_i$ with $Y_i=2$. A practitioner tells you that their misclassification costs are equal, $c_1 = c_2 = 1$, and
would like advice on which algorithm to use for prediction. Given the options:
What would be the best algorithm for this? I think it should be $5$, as the bigger the $K$, the worse the accuracy gets? What would be your choice and why?
You can choose the optimal method using cross-validation. If your sample size is relatively small, use leave-one-out cross-validation... I would not be surprised if $K = 5$ worked well. Linear discriminant analysis (LDA) will not work here because it implies linear decision boundaries. Unless you enlarge the set of predictors with non-linear transformations.
Also, the picture above is a classic case where support vector machines (SVM) with a Gaussian kernel could be of use. R has a friendly implementation of SVM in the "kernlab" package.
Correct answer by stans on March 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP