TransWikia.com

Heterogeneous clustering with text data

Data Science Asked by qbit- on December 2, 2020

I have a dataset which consists of multiple user ratings. Each rating looks similarly to:

| Taste | Flavour | Look | Enjoyed | ..... | Tag    |
|-------|---------|------|---------|-------|--------|
| 4     | 2       | 2    | 3       | ..... | Banana |
| 5     | 4       | 1    | 2       | ..... | Apple  |
| 3     | 1       | 4    | 1       | ..... | Pasta  |
| ....  | ....    | .... | ....    | ....  | ....   |

The columns contain ranks for each row. The task is to clusterize rows,
e.g. I would like to find something similar to:

cluster 1: Banana, Apple
cluster 2: Pasta, Spagetty
....

We use HDBSCAN with edit distance metric to find clusters, and it works more or less.
The problem, however, is that there are too few features (12 in total) to have “good” clusters.
Therefore I would like to somehow account for the information from “Tag”
in clustering. The idea is to calculate embeddings for each tag and use them as features.

What I’m not certain about is how to include these new features? I would
like the clustering to be primarily determined by the original features. The dimension
of embeddings is much larger than the dimension of the original features, and
the metric on these features is different (e.g. cosine similarity). Therefore, I would like to answer 2 questions:

  1. What will be a proper method to combine these heterogeneous features?
  2. How to properly select the weight for the “Tag” feature? Ideally, I would not like to just postulate it

One Answer

TL;DR

Pass the data twice in HDBSCAN.

  1. Cluster based on the tag using word embeddings and cosine distance.
  2. Sub-cluster the cluster from step 1 with your existing method (using the remaining 12 features).

Explanation

I suggest you do it in two steps and not give a weight to the tag feature, because the distance metric used is different. For word-embeddings, you will need to use the cosine distance between embeddings. Where as for the other 12 features you are currently using another distance (Euclidean I assume).

The first step should cluster your data on the semantical characteristic of your tag. This should cluster things like fruits, meats, vegetables, pastas...

Then, the second step can sub-cluster the data with your other 12 features. However, given your example

cluster 1: Banana, Apple

cluster 2: Pasta, Spaghetti

I don't see why this second step is necessary. You could instead of clustering a second time, just use the 12 features for ordering the data points for the purpose of your exercise. E.g. getting the top "fruits", as clustered at step 1, that people "enjoy" the most.

Answered by Bruno Lubascher on December 2, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP