Data Science Asked on May 18, 2021
I have 100 sentences that I want to cluster based on similarity. I’ve used doc2vec to vectorize the sentences into 20 dimensional vectors and applied kmeans to cluster them. I haven’t got the desired results yet.
I’ve read that doc2vec performs well only on large datasets. I want to know if increasing the length of each data sample, would compensate for the low number of samples, and help the model train better?
For example, if my sentences are originally “making coffee”, “making tea”, “playing with dogs”, would changing them to “making coffee requires a cup of milk and some coffee powder”, “making tea requires boiling water and some tea leaves” (supplement each document with more information) help in getting better results? Would the model understand the context better?
Increasing the sentence length might help, but not much. You can try n-grams(2 or 3) while vectorizing the data. Generally more number of sentences will help.
Answered by Naveen Meka on May 18, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP