Data Science Asked on March 16, 2021
I have a dataset with 130 features (1000 rows) . I want to select the best features for my classifier. I started with RFE
but Its taking too long, i done this:
number_of_columns = 130
for i in range(1, number_of_columns):
rfe = RFE(model, i)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test
Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.
First I did PCA
, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.
pca = PCA()
fit = pca.fit(x)
Then I split my data into train and test (with 121 features).
Then I used SelectFromModel
, and I tested it with 4 different classifiers. Each classifier in SelectFromModel
reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:
model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape
End finally I used ‘RFE’. I have used number of columns that i get with ‘SelectFromModel’.
rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
Is this a good approach, or I did something wrong?
Also, If I got the biggest accuracy in SelectFromModel
with one classifier, do I need to use the same classifier in RFE
?
You may have a try on Lasso (l1 penalty) which does automatic feature selection by „shrinking“ parameters. This is one of the standard approaches to data with many columns and „not so many“ rows.
sklearn.linear_model.LogisticRegression(penalty=’l1‘,...
See also this post.
Edit:
The book „Introduction to Statistical Learning“ gives a really good overview. Here are the Python code examples from the book. Section 6.6.2 covers the Lasso.
Answered by Peter on March 16, 2021
For that amount of features I use Selectbest sklearn.feature_selection.SelectKBest
To do this, I take 1/4, 1/3, 1/2, 2/3, 3/4 of all the feaures and analyze how the score used to measure the error varies.
OTHER OPTION:
I use LassoCV sklearn.linear_model.LassoCV
as follows:
kfold_on_rf = StratifiedKFold(
n_splits=10,
shuffle=False,
random_state=SEED
)
lasso_cv = LassoCV(cv=kfold_on_rf, random_state=SEED, verbose=0)
sfm = SelectFromModel(lasso_cv)
Answered by Victor Villacorta on March 16, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP