TransWikia.com

How can I use multiple features in basic sentiment analysis in scikit-learn?

Data Science Asked by 11234 on April 4, 2021

I’ve tried to reduce the problem to it’s absolute basics. Assume I have data (csv) as such:

label,text-column,gender-column,day-column
1,"Sample positive text", female, 1
0,"Sample negative text", female, 3
1,"Another positive comment", male, 2
0,"Angry text sample", male, 7

And I have this code that trains on label by using BoW (in this case tf-idf) on the text-column. I do a 70/30 train test split and all works well.

vec = TfidfVectorizer()
clf = MultinomialNB()
training_data = pd.read_csv('trainset.csv', delimiter=',')
text_tfidf = vec.fit_transform(training_data['text-column'])
# gen_tfidf = vec.fit_transform(training_data['gender-column'])
X_train, X_test, y_train, y_test = train_test_split(text_tfidf, training_data['label'], test_size = 0.3)
clf.fit(X_train, y_train)

However, for the life of me I simply cannot figure out how to use more than one feature. E.g. I want to use, say, both text-column and gender-column to train the model and see how that impacts accuracy, but I don’t understand how to do that!

Am I missing something conceptually important here? Thank you.

2 Answers

You can one-hot encode the gender column and append it to your tfidf table.

gender = pd.get_dummies(training_data['gender-column'])
X = text_tfidf.join(gender)

Answered by Mykola Zotko on April 4, 2021

Scikit-learn has ColumnTransformer for heterogeneous data.

Here is a rough code snippet to get you started:

from sklearn.compose                 import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes             import MultinomialNB
from sklearn.pipeline                import Pipeline
from sklearn.preprocessing           import OneHotEncoder

pipeline = Pipeline([('union', ColumnTransformer([('tfidf',  TfidfVectorizer(), 'text-column'),
                                                  ('onehot', OneHotEncoder(),   'gender-column')])),   
                     ('clf', MultinomialNB())])
pipeline.fit(X_train, y_train)

Answered by Brian Spiering on April 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP