Data Science Asked by Thomas Formal on March 30, 2021
Right now, I’m working on a coming up with a similarity vs dissimilarity matrix for a set of data points for a clustering algorithm. My question is, if I want to use one of the many clustering algorithms given in $R$, such as the K-Medoids algorithm, does it require a similarity or dissimilarity matrix as its parameter?
What’s the difference between the two?
If I use the Gower Distance from the Daisy function in R, does it output a similarity or dissimilarity matrix?
Also lets assume that I have $n$ features and they are all categorical (this is just an example) I a custom distance function where when comparing two data points $G$ and $H$, I use the formula $$sum_i^nX_i$$
where $X_i = 1 $ if feature $i$ of $G$, $G_i$ and feature i of $H$, $H_i$ are equal to each other. So, $$X_i=1$$if and only if $G_i=H_i$ for feature $i$ for all of the $n$ categorical features. Will this result in getting a similarity or dissimilarity matrix?
Also, as mentioned above, if I want to use one of the many clustering algorithms given in $R$, such as the K-Medoids algorithm, does it require a similarity or dissimilarity matrix as its parameter?
In general does the similarity or dissimilarity matrix get used for these
A similarity is larger if the objects are more similar.
A dissimilarity is larger if the objects are less similar.
This sounds trivial, but if you get the sign wrong, you suddenly search for the worst rather than the best solution...
It's easy to see that a distance is always a dissimilarity.
K-medoids could be implemented for similarities, but I am not aware of any implementation that does not expect the data to be a dissimilarity. It may be fine to simply pass -similarity
to many implementations. Because all they care for is to minimize a sum of dissimilarities, which can trivially be shown then to be equivalent to maximizing the sum of similarities.
Answered by Has QUIT--Anony-Mousse on March 30, 2021
In many machine learning packages dissimilarity, which is a distance matrix, is a parameter for clustering (sometimes semi-supervised models).
However the real parameter is type of the distance. You need to tune distance type parameter like k in kmeans. (You need to optimize the distance type according to your business objective).
Check https://en.wikipedia.org/wiki/Distance for distance types. Additionally in some cases, correlation is used for similarity.
Answered by Ilker Kurtulus on March 30, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP