How can I use multiple features in basic sentiment analysis in scikit-learn?

Question

I've tried to reduce the problem to it's absolute basics. Assume I have data (csv) as such:
label,text-column,gender-column,day-column
1,"Sample positive text", female, 1
0,"Sample negative text", female, 3
1,"Another positive comment", male, 2
0,"Angry text sample", male, 7

And I have this code that trains on label by using BoW (in this case tf-idf) on the text-column. I do a 70/30 train test split and all works well.
vec = TfidfVectorizer()
clf = MultinomialNB()
training_data = pd.read_csv('trainset.csv', delimiter=',')
text_tfidf = vec.fit_transform(training_data['text-column'])
# gen_tfidf = vec.fit_transform(training_data['gender-column'])
X_train, X_test, y_train, y_test = train_test_split(text_tfidf, training_data['label'], test_size = 0.3)
clf.fit(X_train, y_train)

However, for the life of me I simply cannot figure out how to use more than one feature. E.g. I want to use, say, both text-column and gender-column to train the model and see how that impacts accuracy, but I don't understand how to do that!
Am I missing something conceptually important here? Thank you.

Mykola Zotko · Answer

You can one-hot encode the gender column and append it to your tfidf table.
gender = pd.get_dummies(training_data['gender-column'])
X = text_tfidf.join(gender)

Brian Spiering · Answer

Scikit-learn has ColumnTransformer for heterogeneous data.
Here is a rough code snippet to get you started:
from sklearn.compose                 import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes             import MultinomialNB
from sklearn.pipeline                import Pipeline
from sklearn.preprocessing           import OneHotEncoder

pipeline = Pipeline([('union', ColumnTransformer([('tfidf',  TfidfVectorizer(), 'text-column'),
                                                  ('onehot', OneHotEncoder(),   'gender-column')])),   
                     ('clf', MultinomialNB())])
pipeline.fit(X_train, y_train)

How can I use multiple features in basic sentiment analysis in scikit-learn?

2 Answers

Add your own answers!

Ask a Question