How to find vertical clusters in 1-D data

Question

I have residuals of a multivariate time series data obtained from sensors on a server.spikes in the plots of residuals indicate abnormal server state. I want to cluster the data into vertical clusters and get the indices of the points in each cluster so that I can go back and look at the actual data and get the mean values of various parameters of a cluster.

I tried 1d Gaussian ,Kmeans etc but they all seem to cluster the data horizontally.

I want each spike to be a separate cluster and get the indices of the values in the cluster. Can any one suggest a technique to overcome this issue.
Thanks.

omega1563 · Answer

You might find some more useful methods if you look for event detection or outlier detection, rather than clustering. Given the shape of the spikiness in the data, you could try using the mean value of the time series as a threshold, then contiguous chunks of values that are greater than the global mean will be your abnormal states. Now this method might be a little bit fragile in the long run, so you might want to make it more sophisticated. You could use the mean value over a large rolling window, rather than the global mean, if you are more interested in local changes.

If you wanted to keep things in a clustering framing, then it might work better if you did the clustering in 2D, i.e. your points will be (time, value) tuples/vectors, rather than doing the clustering on the 1D data. Another thing that would probably work would be to use K-means clustering on the 1D data, but only have 2 clusters (normal and abnormal) you would then have to use time information to separate individual events/anomalies.

grochmal · Answer

Assuming that what you mean by clusters are the colors in your graph, those are basically percentiles. i.e. how many points are above or below a certain percent of data.

To find the points withing a vertical range you just need the points between two percentiles. For example with numpy one can do:

a = np.array([1, 1, 1, 5, 5, 1, 1, 6, 7, 8, 10, 13, 10, 9, 7, 2, 1, 1, 5, 6, 9, 9, 6, 1, 1, 1])
a1 = a[(a >= np.percentile(a, 0)) & (a <= np.percentile(a, 25))]
a2 = a[(a > np.percentile(a, 25)) & (a <= np.percentile(a, 50))]
a3 = a[(a > np.percentile(a, 50)) & (a <= np.percentile(a, 65))]
a4 = a[(a > np.percentile(a, 65)) & (a <= np.percentile(a, 80))]
a5 = a[(a > np.percentile(a, 80)) & (a <= np.percentile(a, 90))]
a6 = a[(a > np.percentile(a, 90)) & (a <= np.percentile(a, 100))]

This gives the points as follows:

a1 contains points within the 0th percentile (inclusive) to 25th percentile (inclusive)
a2 contains the points within the 25th (exclusive) to 50th (inclusive)
a3 contains the points within 50th (exclusive) to 65th (exclusive)
and so on

You need to be careful with the comparison (greater than vs. greater than or equal) to include in all points: one of the ranges will need to be inclusive on both sides.

We can plot this too, to see how it looks:

x = np.arange(len(a))
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(x[np.isin(a, a1)], a1, 'o')
ax.plot(x[np.isin(a, a2)], a2, 'o')
ax.plot(x[np.isin(a, a3)], a3, 'o')
ax.plot(x[np.isin(a, a4)], a4, 'o')
ax.plot(x[np.isin(a, a5)], a5, 'o')

Percentiles are statistical measures by default. They are not fixed numbers. I used very varied percentiles because I wanted to (1) get nice colors on the graph and (2) have at least on empty range (note that a5 is empty since there are no points between 80th and 90th percentiles). In a real scenario one would take very regular ranges, e.g. (0, 25, 50, 75, 100) or (0, 10, 20, 30, 40, 50, 60, 70, 80, 90).

How to find vertical clusters in 1-D data

2 Answers

Add your own answers!

Ask a Question