Data Science Asked by Ruuza on August 26, 2021
I have clustered vectors by cosine distance using nltk clusterer. If I understand correctly, Y axis for elbow method in euclidian distance would be the sum of every distance (squared) between centroid of the cluster with vectors that belongs to that cluster.
My question is: Would it be the same for clusters using cosine distance?
EDIT: ok, so i tried sum of squares with cosine distance and it seems, that it’s returning the same values… heres my code:
EDIT2: My bad,is is working
from nltk.cluster import KMeansClusterer, cosine_distance
import numpy as np
#Load dataset obtained from http://cs.joensuu.fi/sipu/datasets/a1.txt
testing_vectors = np.loadtxt("a1.txt")
for k in range(1,10):
kclusterer = KMeansClusterer(k, distance=cosine_distance)
assigned_clusters = kclusterer.cluster(testing_vectors, assign_clusters=True)
sum_of_squares = 0
current_cluster = 0
for centroid in kclusterer.means():
current_page = 0
for index_of_cluster_of_page in assigned_clusters:
if index_of_cluster_of_page == current_cluster:
y = testing_vectors[current_page]
#sum_of_squares += np.sum((centroid - y) ** 2)
sum_of_squares += (np.dot(centroid,y)**2)/(np.dot(centroid,centroid) * np.dot(y,y))
current_page += 1
current_cluster += 1
print("for k=%s the sum of squares is:%s" %(k,sum_of_squares))
```
Ok. So what I understood is, that for cosine metrics, I can use both: Sum of squared distances from centroids to vectors that belong to clusters, where the distance can be calculated as euclidian or as cosine (cosine would be probably more precise, but more complicated(thanks to dot product)). The squared distance is only used as optimization, so we don't have to calculate the square root in both euclidian and cosine distance formula.
Answered by Ruuza on August 26, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP