How to find farthest data points from a predefined cluster in a data set with Python?

Question

I have a data set where certain rows are labeled as one class (and interpreted as distinct cluster #1 as such), but other points are either unlabeled or ambiguous. Hence I want to figure out which unlabeled data points lie farthest from cluster #1 by sorting them by their respective distance from cluster #1 (more precisely, from the closest point of cluster #1 to the respective unlabeled points).
My first idea would to create a similarity matrix between and calculate the closest distances per unlabeled points from this, but somehow this seems a but clumsy, is there a more elegant/effective way?
(I used to use sklearn for similar tasks, but as far as I know, unsupervised clustering algos don't explicitly provide this kind of specific information.)

etiennedm · Answer

You want to know the nearest neighbor of you unlabeled data in you labeled cluster.
Using sklearn, you can fit a NearestNeighbors() class with a giving metric, algorithm (Ball-tree, KD-tree,...) and all other parameters (see here).
Then get the labeled nearest neighbor from your unlabeled datapoint and its distance by using kneighbors() method.
Here is a sample code:
import numpy as np
from sklearn.neighbors import NearestNeighbors

# Fake data
labeled_samples = [[0, 1.2], [0, 1.3], [0, 1.4]]
unlabeled_samples = [[0, 1.7], [0.5, 0.5], [1, 1]]

# Create your class with your labeled cluster
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(labeled_samples)

# get the distance/index to the nearest neighbor of you unlabeled data
distances, indexes = neigh.kneighbors(unlabeled_samples, 1, return_distance=True)

Then you just have to sort the result.
Note: using this approach is more optimized than computing all distances from all labeled datapoints and then sort them. See this note for more info.

How to find farthest data points from a predefined cluster in a data set with Python?

One Answer

Add your own answers!

Ask a Question