Summary statistics to collect for data that is too large in volume

Question

I have some data on performance counters for some computers. It includes things like memory usage, network usage, etc. emitted every minute (so they are time series). When software updates are deployed to these computers, we suspect the counters might become higher or lower. So, we want to do some two-sided hypothesis tests on them to alert us when this is the case.
The volume of this data is very large and it isn't feasible to collect all of it. Hence, we want to aggregate it every 30 minutes or so. We can collect the mean, standard deviation, percentiles and any other statistics in the aggregation periods. I have some questions:

How do we design a hypothesis test that uses all the statistics (which each form their own time-series)? I could potentially use mean and standard deviation, but surely using the percentiles as well will result in a better hypothesis test?
How do we decide what statistics we should collect for most power?
How do we quantify what we are losing for different levels of aggregation? I presume we should expect a higher false negative rate (lower power) for a given false positive rate. Any thoughts on how we might measure this? Perhaps simulation?

Summary statistics to collect for data that is too large in volume

Add your own answers!

Ask a Question