Data Science Asked on February 25, 2021
Dataset Summary: Bank Loan (classification) problem
Problem Summary:
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import LabelEncoder
# My internal code to read the data file
from src.api.data import LoadDataStore
# Preping Data
raw = LoadDataStore.get_raw()
x_raw = raw.drop(["default_ind", "issue_d"], axis=1)
y_raw = raw[["default_ind"]].values.ravel()
# NA and Encoding
for num_var in x_raw.select_dtypes(include=[numpy.float64]).columns.values:
x_raw[num_var] = x_raw[num_var].fillna(-1)
encoder = LabelEncoder()
for cat_var in x_raw.select_dtypes(include=[numpy.object]).columns.values:
x_raw[cat_var] = x_raw[cat_var].fillna("NA")
x_raw[cat_var] = encoder.fit_transform(x_raw[cat_var])
# Main Part of this problem
test = SelectKBest(score_func=f_classif, k=15)
fit = test.fit(x_raw, y_raw)
ok_var = []
not_var = []
for flag, var in zip(fit.get_support(), x_raw.columns.values):
if flag:
ok_var.append(var)
else:
not_var.append(var)
ok_var
['id', 'member_id', 'int_rate', 'grade', 'sub_grade', 'desc', 'title', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'next_pymnt_d']
not_var
['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'installment', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'last_pymnt_amnt', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m']
Its clear id
, member_id
should NOT belong to the best features list! any idea what I am doing wrong?
Edit: Did more digging and the reply by @Icrmorin is right. (its a kaggle dataset, so will not know why) but here is the box plot for id
There seems to be two possible approaches to your problem :
If they are just identification features that you know aren't informative, you should remove them yourself. SelectKBest - like almost any other EDA tools - works on all the features you provide it, there is no way it knows what features are supposedly uninformative identification features and which are not.
It is possible that somehow the identification feature is informative. I can think of at least two reasons : correlation with time as the instance are entered in order and what you want to observe change with time. Or, if your identification feature is not unique (instances observed trough multiple different times), correlation between your observation. Depending on how your identification feature is built and what you want to achieve, you might want to keep this information or not.
Correct answer by lcrmorin on February 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP