Data Science Asked on December 7, 2021
I was working with a dataset that had a textual column as well as numerical columns, so I used TFIDF
for the textual column and created a sparse matrix, similarly for the numerical features I created a sparse matrix using scipy.sparse.csr_matrix
and combined them with the text sparse features.
Then I’m feeding the algorithm to a gradient boosting model and doing the rest of the training and prediction.
However I want to know, is there any way I can plot the feature importance, of this sparse matrix and will be able to know the important feature column names?
You would have a map of your features from the TFIDF map.
column_names_from_text_features = vectorizer.vocabulary_
rev_dictionary = {v:k for k,v in vectorizer.vocabulary_.items()}
column_names_from_text_features = [v for k,v in rev_dictionary.items()]
Since you know the column names of your other features, the entire list of features you pass to XGBoost (after the scipy.hstack
) could be
all_columns = column_names_from_text_features + other columns
(or depending on the order in which you horizontally stacked)
Now, once you run the XGBoost Model, you can use the plot_importance function for feature importance. Your code would look something like this:
from xgboost import XGBClassifier, plot_importance
fig, ax = plt.subplots(figsize=(15, 8))
plot_importance(<xgb-classifier>, max_num_features = 15, xlabel='F-score', ylabel='Features', ax=ax)
plt.show()
These features would be labeled fxxx, fyyy etc where xxx and yyy are the indices of the features passed to xgboost.
Using the all_columns
constructed in the first part, you could map the features to in indices in the plot encoding.
Answered by srjit on December 7, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP