An efficient algorithm to find Nearest Neighbours

Question

So imagine I have a $m$ vectors each of dimension $d$. Lets call them, $vec x_{i}$, with $i = 1, 2, 3, 4, 5, dots, m$. Now the idea is to find the neighbours of $vec x_{i}$ (calling them $vec x_{j}$), within a ball of radius $r$. So, $$ || vec x_{i} - vec x_{j} || le r$$.
Note: $|| cdots ||$ is the Euclidean norm.
Naively one can compute distance between all the $j$'s for a particular $i$, and then sort them, and keep only the elements which are less than $r$. This is computationally expensive for large $m$ and $d$.

So the question is how can I efficiently compute the neighbours $x_{j}$?

Wolfgang Bangerth · Answer

I suspect that you can get away with a subset of the information by making use of the fact that
$$
  |x_i-x_j| = |x_i-x_k+x_k-x_j| ge |x_i-x_k| - |x_k-x_j|.
$$
You can use this in the following way: Say, you want $r=1$. If $x_1$ and $x_2$ are four units away (not neighbors), and $x_2$ and $x_3$ are one unit away, then $x_1$ and $x_3$ are at least three units away -- also not neighbors. I didn't need to compute the distance between $x_1$ and $x_3$ to determine that.
What this means is that it is possible to use this "reverse triangle inequality" to determine neighborship by knowing only a subset of the entries of the distance matrix
$$
  D_{ij} = |x_i-x_j|.
$$
The question is: What subset do you need to know to make all determinations? I would not be very surprised if the algorithm to determine the minimal subset is NP, but one could probably come up with a cheaper algorithm that, starting from some subset of entries of $D_{ij}$ determines which entries do not need to be computed, then computes some of the other currently unknown ones, and iterates that until for every entry of $D$ you've either computed its value, or determined that you don't need it.

An efficient algorithm to find Nearest Neighbours

One Answer

Add your own answers!

Ask a Question