Scikit-learn SelectKBest is picking up obviously unwanted Features

Question

Dataset
Dataset Summary: Bank Loan (classification) problem
Problem Summary:

I am exploring ways to simplify EDA Process (Exploratory Data Analysis) of finding the best fit variables
I came across SelectKBest from Scikit Package
The implementation went fine except some variables it returned me are obviously not going be a good factor (like primary keys in the dataset)
Is there a problem in the implementation? or is the package supposed to behave in that manner?

import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import LabelEncoder

# My internal code to read the data file
from src.api.data import LoadDataStore

# Preping Data
raw = LoadDataStore.get_raw()
x_raw = raw.drop(["default_ind", "issue_d"], axis=1)
y_raw = raw[["default_ind"]].values.ravel()

# NA and Encoding
for num_var in x_raw.select_dtypes(include=[numpy.float64]).columns.values:
    x_raw[num_var] = x_raw[num_var].fillna(-1)
encoder = LabelEncoder()
for cat_var in x_raw.select_dtypes(include=[numpy.object]).columns.values:
    x_raw[cat_var] = x_raw[cat_var].fillna("NA")
    x_raw[cat_var] = encoder.fit_transform(x_raw[cat_var])

# Main Part of this problem
test = SelectKBest(score_func=f_classif, k=15)
fit = test.fit(x_raw, y_raw)
ok_var = []
not_var = []
for flag, var in zip(fit.get_support(), x_raw.columns.values):
    if flag:
        ok_var.append(var)
    else:
        not_var.append(var)

ok_var
['id', 'member_id', 'int_rate', 'grade', 'sub_grade', 'desc', 'title', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'next_pymnt_d']

not_var
['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'installment', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'last_pymnt_amnt', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m']

Its clear id, member_id should NOT belong to the best features list! any idea what I am doing wrong?

Edit: Did more digging and the reply by @Icrmorin is right. (its a kaggle dataset, so will not know why) but here is the box plot for id

lcrmorin · Accepted Answer

There seems to be two possible approaches to your problem :

If they are just identification features that you know aren't informative, you should remove them yourself. SelectKBest - like almost any other EDA tools - works on all the features you provide it, there is no way it knows what features are supposedly uninformative identification features and which are not.

It is possible that somehow the identification feature is informative. I can think of at least two reasons : correlation with time as the instance are entered in order and what you want to observe change with time. Or, if your identification feature is not unique (instances observed trough multiple different times), correlation between your observation. Depending on how your identification feature is built and what you want to achieve, you might want to keep this information or not.

Scikit-learn SelectKBest is picking up obviously unwanted Features

One Answer

Add your own answers!

Ask a Question