Data Science Asked on January 8, 2021
I’m working on a learning multi-label classification project, for which I’ve taken 16K lines of text and kind of manually classified them achieving around 94% of accuracy/recall (out of three models).
Good results I’d say.
I then though I would have been ready to use my model to predict the label for a set of new similar text but not previously seen/predicted.
However, it appears that – at least with the sklearns models – I can’t simply run the predict against the new dataset as the prediction label array is of a different size.
I am missing something for sure, but at this stage I wonder what considering that I always thought that the classification would have helped in such a task.
If I need to know the "answer", I struggle to understand the benefit of the approach.
Below the approach taken in short:
from gensim import corpora
corpus = df_train.Terms.to_list()
# build a dictionary
texts = [
word_tokenizer(document, False)
for document in corpus
]
dictionary = corpora.Dictionary(texts)
from gensim.models.tfidfmodel import TfidfModel
# create the tfidf vector
new_corpus = [dictionary.doc2bow(text) for text in texts]
tfidf_model = TfidfModel(new_corpus, smartirs='Lpc')
corpus_tfidf = tfidf_model[new_corpus]
# convert into a format usable by the sklearn
from gensim.matutils import corpus2csc
X = corpus2csc(corpus_tfidf).transpose()
# Let fit and predict
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB()
clf.fit(X.toarray(), y)
y_pred = clf.predict(X.toarray())
# At this stage I have my model with the 16K text label.
# Running again almost the above code till X = corpus2csc(corpus_tfidf).transpose().
# Supplying a new dataframe should give me a new vector that I can predict via the clf.predict(X.toarray())
corpus = df.Query.to_list()
# build a dictionary
.....
.....
X = corpus2csc(corpus_tfidf).transpose()
y_pred = clf.predict(X.toarray()) # here I get the error
So everything works fine in using the df_train
(shape (16496, 2)), by the time I repeat the above with my new dataset df
(shape (831, 1), I got the error as above mentioned.
Of course, the second dimension in the first dataset, is the one containing the label, which are used with the fit method, so the problem is not there.
The error is due to the fact that a much smaller corpus has generated just 778 columns, whereas the first set of data with 16k row has generated 3226 columns. This is because I vectorised my corpus as I was after using the TF-IDF to give terms some importance. Perhaps this is the error?
I understand that there are models like PCS that can reduce the dimensionality, but I’m not sure about the opposite.
Anybody can kindly explain?
UPDATE
Nicholas helped to figure out where the error is, though a new one is now appearing always in connection of some missing columns.
See below the code and errors as it stands.
from gensim import corpora
corpus = df_train.Terms.to_list()
# build a dictionary
texts = [
word_tokenizer(document, False)
for document in corpus
]
dictionary = corpora.Dictionary(texts)
from gensim.models.tfidfmodel import TfidfModel
# create the tfidf vector
new_corpus = [dictionary.doc2bow(text) for text in texts]
tfidf_model = TfidfModel(new_corpus, smartirs='Lpc')
corpus_tfidf = tfidf_model[new_corpus]
# convert into a format usable by the sklearn
from gensim.matutils import corpus2csc
X = corpus2csc(corpus_tfidf).transpose()
# Let fit and predict
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB()
clf.fit(X.toarray(), y)
y_pred = clf.predict(X.toarray())
# At this stage I have my model with the 16K text label.
corpus = df.Query.to_list()
unseen_tokens = [word_tokenizer(document, False) for document in corpus]
unseen_bow = [dictionary.doc2bow(t) for t in unseen_tokens]
unseen_vectors = tfidf_model[unseen_bow]
X = corpus2csc(unseen_vectors).transpose() # here I get the errors in the first screenshot
y_pred = clf.predict(X.toarray()) # here I get the errors in the second screenshot
UPDATE 2
I’ve tried also a second approach, using the TfidfVectorizer from sklearn. I did it just in case I was missing something obvious on the previous implementation (you know … the KISS method).
In that circumstance the output is as expected, I got a prediction. So not sure, but I suspect there is a problem somewhere with the corpus2csc
library.
UPDATE 3
Have uploaded the datasets here and here if you want to try. Also a gist is available here.
Cheers
Kudos to @Nicholas to have put myself on the right way.
The specific answer on why this was not working with the Corpora model is due on what I guessed over time.
The corpus2csc
was kind of compressing/forgetting some details.
The solution is to specify the length of the dictionary when transposing the values.
Therefore, from X = corpus2csc(unseen_vectors).transpose()
the code has to become X = corpus2csc(unseen_vectors, num_terms=len(dictionary)).transpose()
.
Hope this may help somebody one day.
Therefore
Correct answer by Andrea Moro on January 8, 2021
You need to use the same preprocessing elements (dictionary etc) that you used to create your tfidf matrix during training when you come to apply your model to unseen data.
Do not create a new dictionary, tfidf_model, etc. for the unseen data, or else
Straight after the line
corpus = df.Query.to_list()
You want something like
unseen_tokens = [word_tokenizer(document, False) for document in corpus]
unseen_bow = [dictionary.doc2bow(t) for t in unseen_tokens]
unseen_vectors = tfidf_model[unseen_bow]
i.e. not creating a new tfidf model or a new dictionary - using the ones you created and used in training.
Answered by Nicholas James Bailey on January 8, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP