Data Science Asked by markagrios on February 16, 2021
I’m working on a project where I have many high-dimensional points and I want to find the most dense neighborhood of them. Ideally, out of my ~500 points that are each a 4×300 matrix (300 ms time series of four variables) I want to find the ~30 points that are the most similar. I’ve looked into k nearest neighbor methods but those are all finding the smallest neighborhoods of a certain point, I want to be able to specify the size N of the neighborhood (how many data points) and get the subset N points that are the closest together.
I think(?) that this is a clustering problem so I’ve looked at soft subspace clustering which clusters using the more relevant dimensions and also I’ve looked into biclustering which was developed specifically for time series data.
Any help would be greatly appreciated! I have a problem that I’m sure someone else has thought of but I don’t think I know exactly how to word it and everything I’ve looked at so far has been close but not exactly what I need. So any papers/code/tutorials/etc that give me any information on how I might solve this would be fantastic 😀
Update: I’ve looked into using Frechet distance to use as a distance between time series. Would that be an appropriate metric to use for this type of data?
If you're looking for the metric to use to determine how close together (or not) some points in a set are, I think there are a few obvious ones to consider:
Diameter, from point-set topology: Look at all pairs of points in the set and take the maximum distance $epsilon$ between any two of them. Call that the "diameter" of the set. In other words, the widest the set is in any direction is $epsilon$. This is easy to compute and might get you what you want.
Find the smallest number $epsilon$ so that there's a ball somewhere of radius $epsilon$ that contains your entire set. It's isn't obvious how to compute this in real life, but here's an equivalent way to do it: pick a really small number $epsilon$ and draw a little ball of radius $epsilon$ around each point in the set. When $epsilon$ is sufficiently small, these little balls don't overlap at all; as you increase $epsilon$, though, they start to overlap a little, and, eventually, all the balls will overlap at some point. Freeze it there and grab that $epsilon$ and call it the size of your set. This is a little hard to compute, but it's the version of set distance used in the original definition of persistent homology, so it's certainly in use in big data analysis already; see here for a nice visualization.
A variation on #2, much easier to calculate and now in use as one of the standard simplifications of persistent homology: pick your $epsilon$ really small so the balls don't overlap at all and gradually bump them up; as the balls start to overlap, count the number that overlap together, ie, that form a continuous region. Do this until all your balls in the set overlap, freeze it, and take that $epsilon$ as the size of your set.
In terms of your actual question, I think all of these will require you to take a small $epsilon$ and start bumping it up and looking for n-element sets of size at most $epsilon$; #1 is probably the easiest to calculate and might give the results you're looking for, although you can certainly see why #3 is in use for this purpose, too.
Answered by mim on February 16, 2021
Look into anomaly detection algorithms. Scikit-learn has a variety of built-in ones in its outlier detection module. Finding 30 most similar points among 500 is the same as finding 470 outliers. Most of these algorithms have a parameter contamination
which specifies what fraction of your dataset you generally expect to classify as outliers, so manipulating the value of this parameter can possibly give you what you want to achieve.
Answered by Hirek Kubica on February 16, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP