ML Approach for Getting List of Observations with Similar Features (Discrete+Continuous)

Question

I have a dataset with 19k observations. Each has approximately 448 features:
 - Text description turned into vectors of size 300 
 - 16 categorical variables represented numerically 
 - The remainder are quantitative features

Each observation also has a list pointing to 10 other observations (from the 19k) that it's most similar to. I want to train an ML model that can understand how the 448 features contribute to this "similarity". Once the model understands that, it could accurately pick the 10 closest existing observations for any new observation, based on how it understands this "similarity".

I've tried clustering with Python's scikit learn (K-Means, MBKM, Affinity Propagation) but these haven't really worked out. I'm currently trying NearestNeighbors but I'm not sure I've got it right.

Please help! Thank you.

Has QUIT--Anony-Mousse · Answer

With mixed variables, weighting is extremely important.

How you seem to have labeled training data of what should be similar. Hence, you should implement some metric learning approach yourself, where the objective is to learn a similarity metric.

I don't think there is a library that you can just use, but you are better off understanding how this works and writing the code yourself.

ML Approach for Getting List of Observations with Similar Features (Discrete+Continuous)

One Answer

Add your own answers!

Ask a Question