Data Science Asked by gsamaras on November 20, 2020
After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code:
def kmeans(data, k, c=None):
if c is not None:
centroids = c
else:
centroids = []
centroids = randomize_centroids(data, centroids, k)
old_centroids = [[] for i in range(k)]
iterations = 0
while not (has_converged(centroids, old_centroids, iterations)):
iterations += 1
clusters = [[] for i in range(k)]
# assign data points to clusters
clusters = euclidean_dist(data, centroids, clusters)
# recalculate centroids
index = 0
for cluster in clusters:
old_centroids[index] = centroids[index]
centroids[index] = np.mean(cluster, axis=0).tolist()
index += 1
print("The total number of data instances is: " + str(len(data)))
I have tested it for serial execution and it is OK. How to make it distributed in Hadoop? In other words, what should go to the reducer and what to the mapper?
Please note that if possible, I would like to follow the tutorial’s style, since it’s something I have understood.
Unless you are trying to do this as a learning exercise, just use Spark which has ML libraries made for distributed computing. See here
Correct answer by Bob Baxley on November 20, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP