Data Science Asked by user66305 on December 1, 2020
I am trying to simultaneously cluster and visualize text documents using Self-organizing maps. Since text documents can be represented in various ways (vector space model, GloVe etc), I am trying to figure out how to tell which representation generates the best map. Measures like Quantization error etc., determine the goodness of the map given a dataset. However, they are not useful for quantitatively telling which representation gives a better output.
Is there a quantitative measure to compare the maps generated using different representations (for example, Tf-idf and GloVe) and tell for which representation the output is better?
From Wikipedia:
A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction.
So you only have the original data itself; no additional data (like labels in a supervised setting). If you are also say the result has to have two dimensions, you basically look at functions
$$f: X rightarrow mathbb{R}^2$$
where $X subsetneq mathbb{R}^n$ in most cases. You already mentioned quantization error.
Up to my knowledge there is nothing better measure which does not include getting more knowledge about the data itself by human inspection / using other datasets.
With human inspection you can, of course, tell for a given dataset and a given human if one mapping seems to make more sense.
You might also consider other dimensionality reduction techniques:
Answered by Martin Thoma on December 1, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP