Procedure for selecting optimal number of features with Python's Scikit-Learn

Question

I have a dataset with 130 features (1000 rows) . I want to select the best features for my classifier. I started with RFE but Its taking too long, i done this:

number_of_columns = 130

for i in range(1, number_of_columns):
    rfe = RFE(model, i)
    fit = rfe.fit(x_train, y_train)
    acc = fit.score(x_test, y_test

Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.

First I did PCA, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.

pca = PCA()
fit = pca.fit(x)

Then I split my data into train and test (with 121 features).

Then I used SelectFromModel, and I tested it with 4 different classifiers. Each classifier in SelectFromModel reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:

model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape

End finally I used 'RFE'. I have used number of columns that i get with 'SelectFromModel'.

rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)

Is this a good approach, or I did something wrong?

Also, If I got the biggest accuracy in SelectFromModel with one classifier, do I need to use the same classifier in RFE?

Peter · Answer

You may have a try on Lasso (l1 penalty) which does automatic feature selection by „shrinking“ parameters. This is one of the standard approaches to data with many columns and „not so many“ rows.
sklearn.linear_model.LogisticRegression(penalty=’l1‘,...

See also this post.
Edit:
The book „Introduction to Statistical Learning“ gives a really good overview. Here are the Python code examples from the book. Section 6.6.2 covers the Lasso.

Victor Villacorta · Answer

For that amount of features I use Selectbest sklearn.feature_selection.SelectKBest

To do this, I take 1/4, 1/3, 1/2, 2/3, 3/4 of all the feaures and analyze how the score used to measure the error varies.

OTHER OPTION:

I use LassoCV
sklearn.linear_model.LassoCV

as follows:

kfold_on_rf = StratifiedKFold(
    n_splits=10, 
    shuffle=False, 
    random_state=SEED
)

lasso_cv = LassoCV(cv=kfold_on_rf, random_state=SEED, verbose=0)
sfm = SelectFromModel(lasso_cv)

Procedure for selecting optimal number of features with Python's Scikit-Learn

2 Answers

Add your own answers!

Ask a Question