Data Science Asked by Salman Ahmed on March 12, 2021
I have a dataset with 19k observations. Each has approximately 448 features:
– Text description turned into vectors of size 300
– 16 categorical variables represented numerically
– The remainder are quantitative features
Each observation also has a list pointing to 10 other observations (from the 19k) that it’s most similar to. I want to train an ML model that can understand how the 448 features contribute to this “similarity”. Once the model understands that, it could accurately pick the 10 closest existing observations for any new observation, based on how it understands this “similarity”.
I’ve tried clustering with Python’s scikit learn (K-Means, MBKM, Affinity Propagation) but these haven’t really worked out. I’m currently trying NearestNeighbors but I’m not sure I’ve got it right.
Please help! Thank you.
With mixed variables, weighting is extremely important.
How you seem to have labeled training data of what should be similar. Hence, you should implement some metric learning approach yourself, where the objective is to learn a similarity metric.
I don't think there is a library that you can just use, but you are better off understanding how this works and writing the code yourself.
Answered by Has QUIT--Anony-Mousse on March 12, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP