Cross Validated Asked by Mehraban on December 18, 2021
DBSCAN is most cited clustering algorithm according to some literature and it can find arbitrary shape clusters based on density. It has two parameters eps (as neighborhood radius) and minPts (as minimum neighbors to consider a point as core point) which I believe it highly depends on them.
Is there any routine or commonly used method to choose these parameters?
Maybe a bit late, but I would like to add an answer here for future knowledge.
One way to find the best $epsilon$ for DBSCAN is to compute the knn
, then sort the distances and see where the "knee" is located.
Example in python, because is the language I manage.:
from sklearn.neighbors import NearestNeighbors
import plotly.express as px
neighbors = 6
# X_embedded is your data
nbrs = NearestNeighbors(n_neighbors=neighbors ).fit(X_embedded)
distances, indices = nbrs.kneighbors(X_embedded)
distance_desc = sorted(distances[:,ns-1], reverse=True)
px.line(x=list(range(1,len(distance_desc )+1)),y= distance_desc )
Then, to find the "knee", you can use another package:
(pip install kneed
)
from kneed import KneeLocator
kneedle = KneeLocator(range(1,len(distanceDec)+1), #x values
distanceDec, # y values
S=1.0, #parameter suggested from paper
curve="convex", #parameter from figure
direction="decreasing") #parameter from figure
To see where the "knee" is, you can run
kneedle.plot_knee_normalized()
the commands kneedle.elbow
or kneedle.knee
returns the index of the x
array, and the kneedle.knee_y
returns the optimum value for $epsilon$.
Answered by Luis Felipe on December 18, 2021
minPts is selected based on the domain knowledge. If you do not have domain understanding, a rule of thumb is to derive minPts from the number of dimensions D in the data set. minPts >= D + 1. For 2D data, take minPts = 4. For larger datasets, with much noise, it suggested to go with minPts = 2 * D.
Once you have the appropriate minPts, in order to determine the optimal eps, follow these steps -
Let's say minPts = 24
Answered by Pratik Nabriya on December 18, 2021
There are plenty of publications that propose methods to choose these parameters.
The most notable is OPTICS, a DBSCAN variation that does away with the epsilon parameter; it produces a hierarchical result that can roughly be seen as "running DBSCAN with every possible epsilon".
For minPts, I do suggest to not rely on an automatic method, but on your domain knowledge.
A good clustering algorithm has parameters, that allow you to customize it to your needs.
A parameter that you overlooked is the distance function. The first thing to do for DBSCAN is to find a good distance function for your application. Do not rely on Euclidean distance being the best for every application!
Answered by Has QUIT--Anony-Mousse on December 18, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP