Data Science Asked by Doctor on July 7, 2021
I’m working on simple machine learning problems and I trying to build a classifier that can differentiate between spam and non-spam SMS. I’m confused as to whether I need to generate the document-term matrix before splitting into test and train sets or should I generate the document-term matrix after splitting into test and train?
I tried it both ways and found that the accuracy is slightly higher when the I split the data before generating the document-term matrix. But to me, this makes no sense. Shouldn’t the accuracy be the same? Does the order of these operations make any difference?
I gave a lot of thought about the question. I agree with you. But the slight difference might come if there are any random variable operation happens during the training. What model are you using for training?
Answered by Ta_Req on July 7, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP