# Selecting uncorrelated samples from a set of bulk data that contains correlated and dependent samples

Cross Validated Asked by Sarmes on January 5, 2022

i have a set of data that is generated by expensive computational model evaluations, on a total data set of 10000 samples in 40 dimensions. This sample data set is composed of different data sets, originating partly from random runs, latin hypercube DOE, radial design DOE, linear parameter studies, and a large part is based on the history data generated by several optimization runs using genetic algorithms.

My thought was that a large part of the function evaluations generated during the genetic algorithms runs, could be some how used to augment them to the set of random and latin hypercube samples, in order to have a larger sample set to perform a variance based sensitivity analysis.

I came up with 2 ideas, but i am an engineer, not a mathematician:

1) using the covariance matrix for the total samples matrix, trying to filter out samples until the of diagonal terms are smaller then some threshold, to avoid correlations.

2)The other idea was to make some sort of minimum distance filter to avoid areas with tightly clustered samples.

Would that be sufficient? are there any tests for randomness, that i could use?
The problem is that i don’t know the right terminology, so maybe there exist ready to run methods for such problems, but i don’t know how to find them, because i don’t know their names.

I am thankful for any helpful suggestions.

Have you thought about orthogonalizing the entire data matrix with PCA? You could replace the columns of $mathbf{X}$ with the un-correlated principal components (eigenvectors normalized to their $sqrt{lambda_m}$).

It sounds like you don't have grouping categorical variables among the 40 variables as well. In this, the only thing you are left with is measuring the association between variables. Indeed, if you are trying to linear and non-linear assessments on sensitivity analysis and variance explanation, then break up the data using a "divide and conquer" approach to solve a large problem by solving smaller problems. Mixtures of variables generated from DOW, LHS, and genetic algorithms sounds quite complex -- but as long as you generate questions singly, and then do the associated analysis to answer the problem, you can work through your analytic goals.

By the way, there doesn't exist variance explanation approaches that allow you to pull out non-linear and linear components using the same model, unless you code what you are doing using non-linear regression and linear regression. There are packages that allow you to fit data based on equations, so maybe look at those (IGOR, EGRET, AMFIT(Poisson), MATLAB, etc.)

Last, be careful of the "so what?" question, whereby after you have done all of your model checking, a reader could ask why you did all of this on simulated data.

Answered by user32398 on January 5, 2022