Data Science Asked on February 19, 2021
I face a classification task: with several features a target features is to be predicted. I’m working with python.
My dataset includes 60 features from which I picked 16 which I think could be relevant (many others are time stamps, for example). The problem is that most of these 16 features are categorical – encoding them with get_dummies
generates 886 features.
The data also includes about 17 million observations.
I am now wondering about how to tackle this problem and what to research and try next. I summarized this in these two questions and I’d love to hear some opinions!
First. If possible, I’d like to reduce the number of features. I tried using SelectFromModel
with a RandomForestClassifier
which worked okay, I think, as features were reduced drastically without much loss of prediction power. However, as my categorial features are split into several new features, only parts of the original features are selected (it’s not einther all or none of the features that originated from one). Is this a problem? If so, can this be avoided?
Second. As one model can be tuned a lot by playing with parameters or input representation, I would like to focus on a few promising models. For faster results, I used only 200,000 observations to train KNeighborsClassifier
, LogisticRegression
, LinearSVC
, DecisionTreeClassifier
, RandomForestClassifier
, GradientBoostingClassifier
and MLPClassifier
(neural network); all from sklearn
.
I would not continue to pursue KNeighborsClassifier
and GradientBoostingClassifier
as they already took a lot of time to train for this small subset.
As RandomForestClassifier
is supposed to perform nearly always better than DecisionTreeClassifier
, I would drop the latter, too, but stick to RandomForestClassifier
.
The two linear models LogisticRegression
and LinearSVC
worked well with penalty = l1
, but very badly with penalty = l2
, so that I would continue to pursue both with penalty = l1
(altough I expect similar results).
My first neural network performed very badly, but I guess there a plety of things to try for improving; however, I expect a very long training time with the full dataset (as this small subset took quite a while already).
I did not try a naive bayes classifier (as it is supposed to perform worse than linear models anyway) or a support vector machine (as they are supposed to perform badly with many observations).
Summary: I would continue my work by looking at RandomForestClassifier
, LogisticRegression
/LinearSVC
and (if you think this is a good idea) neural networks. Is this reasonable?
Thanks a lot!
I don't really have experience with such a massive dataset, but my first thought would be to explore the instances in order to see if so many are needed. I would start with an ablation experiment, trying various sizes of training data with a simple method (random forest seems a good idea) in order to observe the evolution of the performance w.r.t size of the training data. It's likely that the performance reaches a plateau at some point, and it would be useful to know this point.
It might also make sense to study if the data contains duplicates or near duplicates. You can't remove duplicates directly because the distribution matters, but it might be possible to replace them by assigning weights to instances. I'm not expert in this but there are methods for instance selection.
Answered by Erwan on February 19, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP