TransWikia.com

Cluster images labels in some given categories using word embeddings

Data Science Asked by taciturno on April 25, 2021

Given:
set of images Labels in string format each one. Also I’ve given a set of Categories, also in string. ($Images neq Categories $)

Goal: I need to map given labels to given categories to "squeeze" our labels set onto categories set.

Toy example: given two sets: Labels = ['apples', 'juice', 'sun', 'volleyball player', 'birds', 'trees'] and Categories = ['fruits', 'summer']. So result will be simple dictionary dict with elements from Categories set:
dict['fruits'] = ['apples', 'juice', 'trees']; dict['summer'] = ['sun', 'volleyball player', 'birds']

Question: is there is a way to do that? Maybe there are many approaches with growing complexity — it would be good to find out it all.

My approaches:

  1. First idea is very simple and intuitive: we can clusterize our Labels, using K-means, or Agglomerative Clustering. Then take clusters centroids and assigning to each centroid closest in cosine similarity vector from Categories. Mapped.
  2. Second idea is a bit complex — to use Latent Dirichlet Allocation (LDA). First part very similar with first idea: we need to form documents, so we can clusterize Labels and then just assume that every cluster is a document. Then we got a set of topics and can also assign to each topic vector the vector from Categories. Mapped.

The problems in this approaches is a word representation. I want the algorithm to be most accurate, but understand also, that some words connected with another with number of contexts and have a number of meanings. So we need embeddings. Since BERT now is most powerful and interesting idea in word representation I can use BERT and hope it will be the most powerful approach (instead of word2vec).

Few more questions(optional answering, but appreciaеted):

  1. What approach seems more reasionable?
  2. Is it good assumption to make documents (in 2 approach) from labels, i.e. from not-necessarily related to each other?

Since I’m not a pro in DS yet, any help/comments/hints are appreciated.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP