Model selection and assessment using leave-one-out cross validation

Question

Assume $D$ is the training data set with both the value of the predictors $mathbf{X}$ and the value of the response variable $Y$. I have a loss function $L$ and two models $f(mathbf{X};beta)$ and $g(mathbf{X};lambda)$, where $beta$ and $lambda$ are model parameters. Our goal is to estimate

begin{equation}
e(f)=mathbb{E}[L(Y,f(mathbf{X};beta))|D]~mbox{and } e(g)=mathbb{E}[L(Y,g(mathbf{X};lambda))|D]
end{equation}
Note that it is the expectation of the generalization error of the model $f$ (and $g$) that is trained on the specific training data set $D$, where the expectation is taken based on the same distribution that generates $D$.

Now, if we do a leave-one-out procedure, specifically: let $N$ be the total number of observations in $D$, let $D_{-j}$ be the data set that removes the $j^{th}$ observation. Then,$L(Y_j,f(mathbf{X};beta))|D_{-j}$ should be an "almost" unbiased estimator of $e(f)$ right? Theoretically, to get $e(f)$, once should generate infinite new $(Y_i,mathbf{X}^{(i)})$ from the distribution of $D$, then train the model on $D$ and use that model to make prediction on the new infinite data set and take the average. Now we fit the model on $D_{-j}$, which is only slightly different from $D$. So $L(Y_j,f(mathbf{X};beta))|D_{-j}$ should be an "almost" unbiased estimator of $e(f)$. Then you go through all $N$ data points in $D$ to obtain the value of $L(Y_j,f(mathbf{X};beta))|D_{-j}$, where $j=1,2,...,N$ (assume $N$ is large), then you take the average, then the result should be very close to $e(f)$ right? Then we do the same thing to model $g$. Then in this case, we get very good estimates of $e(f)$ and $e(g)$ so we can do model selection based on $e(f)$ and $e(g)$. Specifically, if $e(f)<e(g)$, then I should expect that on a LARGE new independent data set $T$, $f$ should perform better than $g$ correct? Also, the quantity $e(f)$ and $e(g)$ computed from $T$ should be very close to the quantity computed from $D$, assuming both $D$ and $T$ are large. Is that correct?

If all the above are correct, then it seems that I did model selection and model assessment in one step. But should I partition the data set into 3 pieces, i.e., first train 2 models on one piece, then apply the 2 models on another piece to do model selection, then apply the 2 models on the 3rd piece to do model assessment?

Bashar Haddad · Answer

What you just said is dividing the data into three parts: training, validating and testing. 
This is a very common practice in machine learning.
We use validation to help in selecting hyper parameters. 
Going with this option vs. Leave one out vs. cross validation depends mainly on how many number of samples you have. If you have a lot of samples, then going with the option of splitting the data into three parts can be more efficient.
I am not sure about your statement of unbiased. You need to be careful when you make judgment

Elias Strehle · Answer

There are a couple of implicit assumptions that are necessary to make the argument reasonable:

Observations should be independent and identically distributed. Otherwise, almost anything could happen.
The loss function should be of the form $L(Y, Z) = frac1N sum_{i=1}^N l(Y_i, Z_i)$ for some function $l$, where $Z_i = f(mathbf{X}; beta)(mathbf{X}_i)$ is the prediction for $mathbf{X}_i$ after being trained on $mathbf{X}$. A counterexample would be $L(Y, Z) = max_i{|Y_i-Z_i|}.$ The individual maxima from a leave-one-out procedure would not be an unbiased estimator for the maximum over the entire sample.

But even then, your claim is not always correct. Here is a counterexample. It has no practical value, but perhaps it is a good illustration.

Suppose $mathbf{X}$ is $(N times 1)$-dimensional and write $mathbf{X}_i = X_i.$ To illustrate the point, suppose further that $Y_i = X_i$.

Now consider the following model $f$, given a test value $X_j$ and a training set $tilde{mathbf{X}}$: If $X_j$ is in the test set $tilde{mathbf{X}}$, predict $Y_j = X_j.$ Otherwise, predict $Y_j = 42.$

If I evaluate $L(Y, f(mathbf{X}))$, I get a perfect fit and therefore zero loss because all $Y_i = X_i$ are in the test set $mathbf{X}.$

If I evaluate $L(Y_j, f(mathbf{X}_{-j}))$, I generally get a nonzero loss, because my model always predicts $42$.

Model selection and assessment using leave-one-out cross validation

2 Answers

Add your own answers!

Ask a Question