Data Science Asked on July 23, 2021
I have residuals of a multivariate time series data obtained from sensors on a server.spikes in the plots of residuals indicate abnormal server state. I want to cluster the data into vertical clusters and get the indices of the points in each cluster so that I can go back and look at the actual data and get the mean values of various parameters of a cluster.
I tried 1d Gaussian ,Kmeans etc but they all seem to cluster the data horizontally.
I want each spike to be a separate cluster and get the indices of the values in the cluster. Can any one suggest a technique to overcome this issue.
Thanks.
You might find some more useful methods if you look for event detection or outlier detection, rather than clustering. Given the shape of the spikiness in the data, you could try using the mean value of the time series as a threshold, then contiguous chunks of values that are greater than the global mean will be your abnormal states. Now this method might be a little bit fragile in the long run, so you might want to make it more sophisticated. You could use the mean value over a large rolling window, rather than the global mean, if you are more interested in local changes.
If you wanted to keep things in a clustering framing, then it might work better if you did the clustering in 2D, i.e. your points will be (time, value) tuples/vectors, rather than doing the clustering on the 1D data. Another thing that would probably work would be to use K-means clustering on the 1D data, but only have 2 clusters (normal and abnormal) you would then have to use time information to separate individual events/anomalies.
Answered by omega1563 on July 23, 2021
Assuming that what you mean by clusters are the colors in your graph, those are basically percentiles. i.e. how many points are above or below a certain percent of data.
To find the points withing a vertical range you just need the points between two percentiles. For example with numpy
one can do:
a = np.array([1, 1, 1, 5, 5, 1, 1, 6, 7, 8, 10, 13, 10, 9, 7, 2, 1, 1, 5, 6, 9, 9, 6, 1, 1, 1])
a1 = a[(a >= np.percentile(a, 0)) & (a <= np.percentile(a, 25))]
a2 = a[(a > np.percentile(a, 25)) & (a <= np.percentile(a, 50))]
a3 = a[(a > np.percentile(a, 50)) & (a <= np.percentile(a, 65))]
a4 = a[(a > np.percentile(a, 65)) & (a <= np.percentile(a, 80))]
a5 = a[(a > np.percentile(a, 80)) & (a <= np.percentile(a, 90))]
a6 = a[(a > np.percentile(a, 90)) & (a <= np.percentile(a, 100))]
This gives the points as follows:
a1
contains points within the 0th percentile (inclusive) to 25th percentile (inclusive)a2
contains the points within the 25th (exclusive) to 50th (inclusive)a3
contains the points within 50th (exclusive) to 65th (exclusive)You need to be careful with the comparison (greater than vs. greater than or equal) to include in all points: one of the ranges will need to be inclusive on both sides.
We can plot this too, to see how it looks:
x = np.arange(len(a))
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(x[np.isin(a, a1)], a1, 'o')
ax.plot(x[np.isin(a, a2)], a2, 'o')
ax.plot(x[np.isin(a, a3)], a3, 'o')
ax.plot(x[np.isin(a, a4)], a4, 'o')
ax.plot(x[np.isin(a, a5)], a5, 'o')
Percentiles are statistical measures by default. They are not fixed numbers. I used very varied percentiles because I wanted to (1) get nice colors on the graph and (2) have at least on empty range (note that a5
is empty since there are no points between 80th and 90th percentiles). In a real scenario one would take very regular ranges, e.g. (0, 25, 50, 75, 100)
or (0, 10, 20, 30, 40, 50, 60, 70, 80, 90)
.
Answered by grochmal on July 23, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP