TransWikia.com

How to make k-means distributed?

Data Science Asked by gsamaras on November 20, 2020

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code:

def kmeans(data, k, c=None):
    if c is not None:
        centroids = c
    else:
        centroids = []
        centroids = randomize_centroids(data, centroids, k)

    old_centroids = [[] for i in range(k)] 

    iterations = 0
    while not (has_converged(centroids, old_centroids, iterations)):
        iterations += 1

        clusters = [[] for i in range(k)]

        # assign data points to clusters
        clusters = euclidean_dist(data, centroids, clusters)

        # recalculate centroids
        index = 0
        for cluster in clusters:
            old_centroids[index] = centroids[index]
            centroids[index] = np.mean(cluster, axis=0).tolist()
            index += 1


    print("The total number of data instances is: " + str(len(data)))

I have tested it for serial execution and it is OK. How to make it distributed in Hadoop? In other words, what should go to the reducer and what to the mapper?

Please note that if possible, I would like to follow the tutorial’s style, since it’s something I have understood.

One Answer

Unless you are trying to do this as a learning exercise, just use Spark which has ML libraries made for distributed computing. See here

Correct answer by Bob Baxley on November 20, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP