Data Science Asked on April 6, 2021
I have a regular 2D grid of data points (X, Y) with each point having a value.
I’d like to identify clusters and then anomalies that don’t belong to those clusters.
I’m trying to understand the best way to do this and it looks like DBSCAN might be an appropriate algorithm to use but I’m a bit confused.
I’m not clear whether I should collapse the 2D (X,Y) data down into 1 dimension and run DBSCAN on the 1D array. That would lose the spatial component of the data but would then identify clusters based on the value of the data point.
Or do I run DBSCAN on a 3D array (the X index, the Y index and it’s corresponding data value for that X,Y point)? That sounds intuitively like it might be the better way to approach this but I’m not 100% sure.
If anyone could point me towards an example of using DBSCAN for spatial clustering that’d be really helpful.
Clustering in 3d is great. But be careful with feature scaling in this case. Presumably X, Y have the same scales - so unless you want to treat the different directions differently, make sure not not apply any normalization. As that would distort the grid.
Your Value column on the order hand, might be on a very different scale from the X,Y values. If not corrected for this creates an implicit and usually quite arbitrary feature weighting, which is rarely desirable. If you want to weight the Value column equally to X and Y, then scale the Values by the either mean,std or max,min of the X/Y columns.
If you do want to weight the individual attributes differently, then you can introduce additionally some weighting factors that scales them.
PS: remember that DBscan hyperparamters, such as epsilon
is also given in the same unit as your feature space.
Often an interesting question is "is this value anomalous compared to the normal for this location?". This can be done on X and Y axis individually, but often there are interactions between X,Y and one should consider the two dimensions together. Here are some approaches for that:
Divide the X and Y axis into a grid. Typically with uniformly sized cells. Assign each grid cell an location identifier. Then use this identifier as a feature.
This approach is simple and with a lot of dense data it usually works pretty OK. But the grid can fit the data pretty poorly, datapoints very close to eachother end up in different cells, and then quite distant in feature space.
An alternative location scheme is to use clustering. Cluster the values just based on X,Y - and then use the cluster number as the location identifier. You may also want to include distance from the cluster center as a feature - since some values might fall pretty far from a cluster. Angle from the cluster can in some cases also be informative.
This has a much better shot at creating good location features for the data. But is more complex. One also should consider when one wants to re-cluster.
AsideMake sure to plot your data and look at it! For example use a 2d-plot with X,Y as coordinates, and the values using color. You may also want to do the same kind of plot, but then to use Anomaly Score as the color. If you upload a plot here, it will also be much easier to give good advice.
Correct answer by Jon Nordby on April 6, 2021
Check out this comparison here. Intuitively I would say fit on everything you've got, don't throw anything away.
Answered by N. Kiefer on April 6, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP