Heterogeneous clustering with text data

Question

I have a dataset which consists of multiple user ratings. Each rating looks similarly to:

| Taste | Flavour | Look | Enjoyed | ..... | Tag    |
|-------|---------|------|---------|-------|--------|
| 4     | 2       | 2    | 3       | ..... | Banana |
| 5     | 4       | 1    | 2       | ..... | Apple  |
| 3     | 1       | 4    | 1       | ..... | Pasta  |
| ....  | ....    | .... | ....    | ....  | ....   |

The columns contain ranks for each row. The task is to clusterize rows, 
e.g. I would like to find something similar to:

cluster 1: Banana, Apple
cluster 2: Pasta, Spagetty
....

We use HDBSCAN with edit distance metric to find clusters, and it works more or less.
The problem, however, is that there are too few features (12 in total) to have "good" clusters.
Therefore I would like to somehow account for the information from "Tag"
in clustering. The idea is to calculate embeddings for each tag and use them as features.

What I'm not certain about is how to include these new features? I would 
like the clustering to be primarily determined by the original features. The dimension 
of embeddings is much larger than the dimension of the original features, and 
the metric on these features is different (e.g. cosine similarity). Therefore, I would like to answer 2 questions:

What will be a proper method to combine these heterogeneous features?
How to properly select the weight for the "Tag" feature? Ideally, I would not like to just postulate it

Bruno Lubascher · Answer

TL;DR

Pass the data twice in HDBSCAN.

Cluster based on the tag using word embeddings and cosine distance.
Sub-cluster the cluster from step 1 with your existing method (using the remaining 12 features).

Explanation

I suggest you do it in two steps and not give a weight to the tag feature, because the distance metric used is different. For word-embeddings, you will need to use the cosine distance between embeddings. Where as for the other 12 features you are currently using another distance (Euclidean I assume).

The first step should cluster your data on the semantical characteristic of your tag. This should cluster things like fruits, meats, vegetables, pastas...

Then, the second step can sub-cluster the data with your other 12 features. However, given your example

cluster 1: Banana, Apple
  
  cluster 2: Pasta, Spaghetti

I don't see why this second step is necessary. You could instead of clustering a second time, just use the 12 features for ordering the data points for the purpose of your exercise. E.g. getting the top "fruits", as clustered at step 1, that people "enjoy" the most.

Heterogeneous clustering with text data

One Answer

TL;DR

Explanation

Add your own answers!

Ask a Question