Data Science Asked by 11234 on April 4, 2021
I’ve tried to reduce the problem to it’s absolute basics. Assume I have data (csv) as such:
label,text-column,gender-column,day-column
1,"Sample positive text", female, 1
0,"Sample negative text", female, 3
1,"Another positive comment", male, 2
0,"Angry text sample", male, 7
And I have this code that trains on label by using BoW (in this case tf-idf) on the text-column. I do a 70/30 train test split and all works well.
vec = TfidfVectorizer()
clf = MultinomialNB()
training_data = pd.read_csv('trainset.csv', delimiter=',')
text_tfidf = vec.fit_transform(training_data['text-column'])
# gen_tfidf = vec.fit_transform(training_data['gender-column'])
X_train, X_test, y_train, y_test = train_test_split(text_tfidf, training_data['label'], test_size = 0.3)
clf.fit(X_train, y_train)
However, for the life of me I simply cannot figure out how to use more than one feature. E.g. I want to use, say, both text-column and gender-column to train the model and see how that impacts accuracy, but I don’t understand how to do that!
Am I missing something conceptually important here? Thank you.
You can one-hot encode the gender column and append it to your tfidf table.
gender = pd.get_dummies(training_data['gender-column'])
X = text_tfidf.join(gender)
Answered by Mykola Zotko on April 4, 2021
Scikit-learn has ColumnTransformer for heterogeneous data.
Here is a rough code snippet to get you started:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
pipeline = Pipeline([('union', ColumnTransformer([('tfidf', TfidfVectorizer(), 'text-column'),
('onehot', OneHotEncoder(), 'gender-column')])),
('clf', MultinomialNB())])
pipeline.fit(X_train, y_train)
Answered by Brian Spiering on April 4, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP