Split into test and train set before or after generating document-term matrix?

Question

I'm working on simple machine learning problems and I trying to build a classifier that can differentiate between spam and non-spam SMS. I'm confused as to whether I need to generate the document-term matrix before splitting into test and train sets or should I generate the document-term matrix after splitting into test and train?

I tried it both ways and found that the accuracy is slightly higher when the I split the data before generating the document-term matrix. But to me, this makes no sense. Shouldn't the accuracy be the same? Does the order of these operations make any difference?

Ta_Req · Answer

I gave a lot of thought about the question. I agree with you. But the slight difference might come if there are any random variable operation happens during the training. What model are you using for training?

Answered by Ta_Req on July 7, 2021

Split into test and train set before or after generating document-term matrix?

One Answer

Add your own answers!

Ask a Question