Why does the BFR (Bradley, Fayyad and Reina) algorithm assume clusters to be normally distributed around its centroid?

Question

I'm following a course on data mining based on the lectures from Stanford University and the book Mining of massive datasets.

On the topic of clustering, the BFR algorithm is explained with this video.
I understand how the algorithm works, but I am unclear on the reason why the algorithm makes the strong assumption that each cluster is normally distributed around a centroid in Euclidian space.

The video explains that the assumption implies that clusters look like axis-aligned ellipses, which is understandable as the dimensions must be independent.
I've watched the video a few times, and read the section in the book (freely downloadable using the first link) on pages 257-259, but I'm unable to grasp why that assumption is made, and why it has to be made.

Could someone explain this for me?

user1315621 · Accepted Answer

Roughly, the algorithm needs to estimate the probability to assign a point the correct cluster.
So the algorithm add P to a cluster if it is very unlikely that, after all the points have been processed, some other cluster centroid will be found to be nearer to P.
So the algorithm measure the probability that, if P belongs to a cluster, it would be found as far as it is from the centroid of that cluster. To do that, it assumes that the clusters contain normally distributed points aligned with the axes of the used space.

Why does the BFR (Bradley, Fayyad and Reina) algorithm assume clusters to be normally distributed around its centroid?

One Answer

Add your own answers!

Ask a Question