How to optimize sampling for parameter estimation

Question

I have a computer model with a number of parameters that need to be calibrated based on experimental results. It’s also important to understand the sensitivity of the results to each parameter individually. The intuitive approach to this problem is to sample each parameter (say m realization from each parameter) and conduct a factorial experiment to see the main effects and interactions between parameters. As it’s repeatedly reported, the problem with this approach is its computational costs as for a model with 10 parameters, even with only two-level experiment, it would cost $2^{10}$ runs, which is not feasible for my model.

I am therefore searching for an alternative sampling approach that rescues me from this headache. Any guidance is appreciated.

statistics stochastic

I am therefore searching for an alternative sampling approach that rescues me from this headache. Any guidance is appreciated.

JN_ · Answer

The computational costs of sensitivity analysis for a considerably large simulation is attributed to two main phenomena. Firstly, as it's mentioned in the body of the question, due to a large number of parameters, even a small selection of samples from each parameter would lead to an astronomical number of runs. Secondly, a small sampling number (two-level experiment) is never enough to even roughly understand the behavior of a system with regards to its parameters.

After spending a couple of days, this is what I understood. There are usually two steps taken to address the raised concerns. First, instead of using a full factorial design, one can use the fractional factorial design to reduce the number of runs to address the first concern. This approach assumes that most of the information provided by the full factorial design is redundant and can be ignored accordingly. Read more on Wikipedia. Using the fractional factorial design with a two-level factor, one can conduct a primary study on parameters and drop those that are insignificant. This would result in a much smaller parameter set that can be more extensively investigated using other methods (mentioned below). For the sake of completeness, there are also other methods in this scope to reduce the complexity of full factorial design such as Plackett-Burman designs, Cotter designs, and mixed-level designs.

In the second step, in order to further understand the system, one can use methods such as those classified as Space-filling design (e.g. Latin hypercube sampling-LHD) to investigate each parameter's influence on the system in more detail. Different methods use different sampling approaches based on the nature of a problem. I find this open article ("Sensitivity analysis by design of experiments") by Schepdael et al 2016 useful in this regard.

One more point. If running your model takes a substantial amount of time and executing thousands of runs based on the aforementioned methods are not feasible, check out Approximate Approximate Bayesian Calculation. This method develops a statistical model based on the limited result sets from your real model an executes more extensive runs on the statistical model instead. I think this is the main article elaborating this method: "AABC: Approximate approximate Bayesian computation for inference in population-genetic models" by Buzbas et al 2015.

Please correct me or let me know your opinions in the comments.

kensaii · Answer

What you have just described is called the curse of dimensionality, in which the number of samples scales exponentially with respect to dimensionality $d$. Supposed that you want $s$ samples in each dimension, then you would need $s^d$ to explore your design space. Fit $s=2$ and $d=10$ in your example.
This is usually referred to as full-tensor grid in the literature. There is an approach called sparse grid, which significantly reduces the number of samples, while retains the desired accuracy. Basically, you want an accurate interpolation/integration method on high-dimensional space. See :

Barthelmann V, Novak E and Ritter K 2000 High dimensional polynomial interpolation on sparse grids Adv. Comput. Math. 12 273–88
Novak E and Ritter K 1999 Simple cubature formulas with high polynomial exactness
Constructive Approx. 15 499–522

Open-source code are available. Notable software is DAKOTA from Sandia National Laboratories, you can download from http://dakota.sandia.gov/.
If your dimensionality is too high ($d>1,000$ or so), you are better off using Monte Carlo, since the accuracy is independent of the dimensionality (only depends on the number of samples). Downside: you have to sample a lot, $mathcal{O}(10^5) - mathcal{O}(10^6)$ or so, to achieve good accuracy.

How to optimize sampling for parameter estimation

2 Answers

Add your own answers!

Ask a Question