Data Science Asked by KevinKim on July 30, 2021
Assume $D$ is the training data set with both the value of the predictors $mathbf{X}$ and the value of the response variable $Y$. I have a loss function $L$ and two models $f(mathbf{X};beta)$ and $g(mathbf{X};lambda)$, where $beta$ and $lambda$ are model parameters. Our goal is to estimate
begin{equation}
e(f)=mathbb{E}[L(Y,f(mathbf{X};beta))|D]~mbox{and } e(g)=mathbb{E}[L(Y,g(mathbf{X};lambda))|D]
end{equation}
Note that it is the expectation of the generalization error of the model $f$ (and $g$) that is trained on the specific training data set $D$, where the expectation is taken based on the same distribution that generates $D$.
Now, if we do a leave-one-out procedure, specifically: let $N$ be the total number of observations in $D$, let $D_{-j}$ be the data set that removes the $j^{th}$ observation. Then,$L(Y_j,f(mathbf{X};beta))|D_{-j}$ should be an “almost” unbiased estimator of $e(f)$ right? Theoretically, to get $e(f)$, once should generate infinite new $(Y_i,mathbf{X}^{(i)})$ from the distribution of $D$, then train the model on $D$ and use that model to make prediction on the new infinite data set and take the average. Now we fit the model on $D_{-j}$, which is only slightly different from $D$. So $L(Y_j,f(mathbf{X};beta))|D_{-j}$ should be an “almost” unbiased estimator of $e(f)$. Then you go through all $N$ data points in $D$ to obtain the value of $L(Y_j,f(mathbf{X};beta))|D_{-j}$, where $j=1,2,…,N$ (assume $N$ is large), then you take the average, then the result should be very close to $e(f)$ right? Then we do the same thing to model $g$. Then in this case, we get very good estimates of $e(f)$ and $e(g)$ so we can do model selection based on $e(f)$ and $e(g)$. Specifically, if $e(f)<e(g)$, then I should expect that on a LARGE new independent data set $T$, $f$ should perform better than $g$ correct? Also, the quantity $e(f)$ and $e(g)$ computed from $T$ should be very close to the quantity computed from $D$, assuming both $D$ and $T$ are large. Is that correct?
If all the above are correct, then it seems that I did model selection and model assessment in one step. But should I partition the data set into 3 pieces, i.e., first train 2 models on one piece, then apply the 2 models on another piece to do model selection, then apply the 2 models on the 3rd piece to do model assessment?
What you just said is dividing the data into three parts: training, validating and testing. This is a very common practice in machine learning. We use validation to help in selecting hyper parameters. Going with this option vs. Leave one out vs. cross validation depends mainly on how many number of samples you have. If you have a lot of samples, then going with the option of splitting the data into three parts can be more efficient. I am not sure about your statement of unbiased. You need to be careful when you make judgment
Answered by Bashar Haddad on July 30, 2021
There are a couple of implicit assumptions that are necessary to make the argument reasonable:
But even then, your claim is not always correct. Here is a counterexample. It has no practical value, but perhaps it is a good illustration.
Suppose $mathbf{X}$ is $(N times 1)$-dimensional and write $mathbf{X}_i = X_i.$ To illustrate the point, suppose further that $Y_i = X_i$.
Now consider the following model $f$, given a test value $X_j$ and a training set $tilde{mathbf{X}}$: If $X_j$ is in the test set $tilde{mathbf{X}}$, predict $Y_j = X_j.$ Otherwise, predict $Y_j = 42.$
If I evaluate $L(Y, f(mathbf{X}))$, I get a perfect fit and therefore zero loss because all $Y_i = X_i$ are in the test set $mathbf{X}.$
If I evaluate $L(Y_j, f(mathbf{X}_{-j}))$, I generally get a nonzero loss, because my model always predicts $42$.
Answered by Elias Strehle on July 30, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP