Knowing Feature Importance from Sparse Matrix

Question

I was working with a dataset which had a textual column as well as numerical columns, so I used tfidf for textual column and created a sparse matrix, similarly for the numerical features I created a sparse matrix using scipy.sparse.csr_matrix and combined them with the text sparse features.

Then I'm feeding the algorithm to a gradient boosting model and doing the rest of the training and prediction. However I want to know, is there any way I can plot the feature importance, of this sparse matrix and will be able to know the important feature column names?

srjit · Answer

You would have a map of your features from the TFIDF map.

column_names_from_text_features = vectorizer.vocabulary_
rev_dictionary = {v:k for k,v in vectorizer.vocabulary_.items()}
column_names_from_text_features = [v for k,v in rev_dictionary.items()]

Since you know the column names of your other features, the entire list of features you pass to XGBoost (after the scipy.hstack) could be

all_columns = column_names_from_text_features + other columns

(or depending on the order in which you horizontally stacked)

Now, once you run the XGBoost Model, you can use the plot_importance function for feature importance. Your code would look something like this:

from xgboost import XGBClassifier, plot_importance
fig, ax = plt.subplots(figsize=(15, 8))
plot_importance(<xgb-classifier>, max_num_features = 15, xlabel='F-score', ylabel='Features', ax=ax)
plt.show()

These features would be labeled fxxx, fyyy etc where xxx and yyy are the indices of the features passed to xgboost.

Using the all_columns constructed in the first part, you could map the features to in indices in the plot encoding.

Knowing Feature Importance from Sparse Matrix

One Answer

Add your own answers!

Ask a Question