Data Science Asked by qbit- on December 2, 2020
I have a dataset which consists of multiple user ratings. Each rating looks similarly to:
| Taste | Flavour | Look | Enjoyed | ..... | Tag |
|-------|---------|------|---------|-------|--------|
| 4 | 2 | 2 | 3 | ..... | Banana |
| 5 | 4 | 1 | 2 | ..... | Apple |
| 3 | 1 | 4 | 1 | ..... | Pasta |
| .... | .... | .... | .... | .... | .... |
The columns contain ranks for each row. The task is to clusterize rows,
e.g. I would like to find something similar to:
cluster 1: Banana, Apple
cluster 2: Pasta, Spagetty
....
We use HDBSCAN with edit distance metric to find clusters, and it works more or less.
The problem, however, is that there are too few features (12 in total) to have “good” clusters.
Therefore I would like to somehow account for the information from “Tag”
in clustering. The idea is to calculate embeddings for each tag and use them as features.
What I’m not certain about is how to include these new features? I would
like the clustering to be primarily determined by the original features. The dimension
of embeddings is much larger than the dimension of the original features, and
the metric on these features is different (e.g. cosine similarity). Therefore, I would like to answer 2 questions:
Pass the data twice in HDBSCAN.
tag
using word embeddings and cosine distance.I suggest you do it in two steps and not give a weight to the tag
feature, because the distance metric used is different. For word-embeddings, you will need to use the cosine distance between embeddings. Where as for the other 12 features you are currently using another distance (Euclidean I assume).
The first step should cluster your data on the semantical characteristic of your tag
. This should cluster things like fruits, meats, vegetables, pastas...
Then, the second step can sub-cluster the data with your other 12 features. However, given your example
cluster 1: Banana, Apple
cluster 2: Pasta, Spaghetti
I don't see why this second step is necessary. You could instead of clustering a second time, just use the 12 features for ordering the data points for the purpose of your exercise. E.g. getting the top "fruits", as clustered at step 1, that people "enjoy" the most.
Answered by Bruno Lubascher on December 2, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP