Error term in probabilistic interpretation of least squares update rule

Data Science Asked by Matthew Yang on June 7, 2021

I have read in Stanford’s CS229 course notes that to justify the least-squares update rule with probability, the following is assumed:

$$y^{(i)} = theta^Tx^{(i)}+epsilon^{(i)}$$

, where $epsilon^{(i)}$ represents random noise that is distributed i.i.d. w.r.t the Normal distribution.

I understand why $epsilon^{(i)}$ would make sense when $h(theta)=theta^{T}x^{(i)}$ is a trained model, but since this assumption’s eventual goal is to derive the update rule, it should make sense also when $h(theta)$ is not trained yet. However, this assumption does not make too much sense to me when the model is arbitrary and not trained at all. Is my interpretation correct? Have I missed something? If not, how do we justify $epsilon^{(i)}$ when the model is inaccurate (not trained)?

Thanks in advance.

gradient descent linear regression machine learning mathematics probability

Add your own answers!

Ask a Question

Get help from others!