Sampling trying to keep as much multivariate variance as possible

Question

I was thinking if anyone considered a sampling technique that would try to aim keeping as much of the variance as possible (e.g. as many unique values, or very widely distributed continuous variables).

The benefit might be that it will allow development of code around the sample, and really work with the edge cases in the data.

You can then later always take a representative sample.

So, I am wondering if people have tried to sample for maximum variance before and if there is a clever way to sample with as high possible variance (of course an approximation is just fine).

Brian Spiering · Answer

It depends on what you mean by sampling. Is it sampling between or within features?

For between features, scikit-learn has a built-in option for VarianceThreshold which removes features whose variance does not meet some threshold.

Sampling trying to keep as much multivariate variance as possible

One Answer

Add your own answers!

Ask a Question