Cross Validated Asked by bk_ on December 13, 2021
I have 40 univariate Time series which I am clustering with tslearn.
To determine a reasonable amount of clusters, I use the silhouette coefficient. However, I noticed that it is extremely unrobust, at it delivers different maxima.
I use dynamic time warping as distance measure and perform an minmax transformation to preprocess the time series
I cannot share the data, but my the df looks like this: (just a small piece)
time | value | label
2020-01-01 1.3 10000
2020-01-02 1.9 10000
2020-01-01 0.5 20000
2020-01-02 1.2 20000
my code:
# imports
from tslearn.clustering import TimeSeriesKMeans, silhouette_score
from sklearn.preprocessing import minmax_scale
import pandas as pd
# get list of time series, perform minmax-transformation
ts = []
for ts_label in df[self.ts_col].unique():
ts.append(minmax_scale(df.loc[df[label] == ts_label, 'value']))
# loop through different configurations for # of clusters and store the respective values for silhouette:
sil_scores = []
for n in range(2, 10):
km = TimeSeriesKMeans(n_clusters=n, metric="dtw")
km.fit(ts)
sil_scores.append(silhouette_score(ts, km.predict(ts), metric="dtw"))
# prepare resulting df
result_df = pd.DataFrame(data={
"no_clusters": range(min_n, max_n+1),
"silhouette_score": sil_scores,
})
however, if I repeat this process for multiple times, I get different results: The highest value for silhouette_score
is either at 2, 3 or 5 clusters (I tried this 11 times and got four times 2 / five time 3 / two times 5)
Is there an error in my code / methodology or is this a common problem of silhouette score?
Thanks in advance
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP