Data Science Asked by lona on July 22, 2020
Suppose I have dataset labeled with two classes such as healthy and unhealthy and I applied feature selection (features importance)on dataset.
How can I know that features are important to which class(to healthy or to unhealthy )?
Assuming we are talking about feature importance for decision tree algorithms here. You cannot really say. It only tells you how often a feature is used to split both classes apart.
If you want more insight in how your model makes decision you could look into SHAP and LIME. Both are methods that approximate your model and then tries to explain it. You can check out these two libraries in Python:
Answered by Simon Larsson on July 22, 2020
Something like this should get you going.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("https://rodeo-tutorials.s3.amazonaws.com/data/credit-data-trainingset.csv")
df.head()
from sklearn.ensemble import RandomForestClassifier
features = np.array(['revolving_utilization_of_unsecured_lines',
'age', 'number_of_time30-59_days_past_due_not_worse',
'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans',
'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents'])
clf = RandomForestClassifier()
clf.fit(df[features], df['serious_dlqin2yrs'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
Answered by ASH on July 22, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP