How to make k-means distributed?

Question

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code:

def kmeans(data, k, c=None):
    if c is not None:
        centroids = c
    else:
        centroids = []
        centroids = randomize_centroids(data, centroids, k)

old_centroids = [[] for i in range(k)]

iterations = 0
    while not (has_converged(centroids, old_centroids, iterations)):
        iterations += 1

clusters = [[] for i in range(k)]

# assign data points to clusters
        clusters = euclidean_dist(data, centroids, clusters)

# recalculate centroids
        index = 0
        for cluster in clusters:
            old_centroids[index] = centroids[index]
            centroids[index] = np.mean(cluster, axis=0).tolist()
            index += 1

print("The total number of data instances is: " + str(len(data)))

I have tested it for serial execution and it is OK. How to make it distributed in Hadoop? In other words, what should go to the reducer and what to the mapper?

Please note that if possible, I would like to follow the tutorial's style, since it's something I have understood.

Bob Baxley · Accepted Answer

Unless you are trying to do this as a learning exercise, just use Spark which has ML libraries made for distributed computing. See here

How to make k-means distributed?

One Answer

Add your own answers!

Ask a Question