Naive Bayes and Support Vector Machine (NBSVM) Classification

Question

I am relatively new to datascience and have a question about NBSVM. I have a two class problem and text data (headlines from the newspaper). I want to use NBSVM to predict whether a headline has the label 0 or 1.

How I understood it, how I have to proceed now:

convert the headlines to a document term matrix
calculate the log-count ratio. As I understood it, these are the probabilities of the individual documents for a class (i.e. the probability that a document is in class 0 or class 1). Please correct me if I'm wrong here.
the log-count ratios then serve as input for the SVM. It inserts the ratios and sets the boundary between the two classes. When new data comes, the SVM tells you to which class the data belongs.

Is this right? Please note that this is only a theoretical procedure, not an implementation.

Harish Kumar · Answer

you use sklearn "CountVectorizer" and "TfidfVectorizer" to covert the text data into vector

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['class'], random_state = 0)

# vector representations of the text 
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# Building a SVM model
svmmodel = LinearSVC().fit(X_train_tfidf, y_train)

Naive Bayes and Support Vector Machine (NBSVM) Classification

One Answer

Add your own answers!

Ask a Question