Normal equation for linear regression is illogical

Question

Currently I'm taking Andrew Ng's course. He gives a following formula to find solution for linear regression analytically:
$θ = (X^T * X)^{-1} * X^T * у$
He doesn't explain it so I searched for it and found that $(X^T * X)^{-1} * X^T$ is actually a formula of pseudoinverse in a case where our columns are linear independent. And this actually makes a lot of sense. Basically, we want to find such $θ$ that $X * θ = y$, thus $θ = y * X^{-1}$, so if we replace $X^{-1}$ with our pseudoinverse formula we get exactly the $θ = (X^T * X)^{-1} * X^T * y$.
What I don't understand is why nobody mentions that this verbose formula is just $θ = y * X^{-1}$ with pseudoinverse. Okay, Andrew Ng's course is for beginners and he didn't want to throw a bunch of math at students. But octave, where the assignments are done, has function pinv() to find a pseudoinverse. Even more, Andrew Ng actually mentions pseudoinverse in his videos on normal equation, in the context of $(X^T * X)$ being singular so that we can't find its inverse. As I mentioned above,  $(X^T * X)^{-1} * X^T$ is a formula for pseudoinverse only in case of columns being linear independent. If they are dependent (e.g. some features are redundant), there are other formulae to consider, but anyway octave handles all these cases under the hood of pinv() function, which is more than just a macro for  $(X^T * X)^{-1} * X^T$. And Andrew Ng instead of saying to use pinv(X) * y gives this: pinv(X' * X) * X' * y, basically we use a pseudoinverse to find a pseudoinverse. Why?

Pedro Henrique Monforte · Accepted Answer

Hello Oleksii and welcome to DSSE.
The formula you are asking about is not for a pseudoinverse.
$theta = (X^TX)^{-1}X^Ty$
Where

$theta$ is your regressor
$X$ is a matriz containing stacked vectors (as rows) of your features/independent variables
$y$ is a matriz containing stacked vectors (or scalars) of your predictions/dependent variables

This equation is a solution to a linear set of equations:
$Ax = B$ that occurs in trying to minimize the least squares loss.
The reason why you see this pinv() on the code is because if X has not enough linearly independent rows, $X^TX$ (also known as $R$, the autocorrelation matriz of the data, and it's inverse is called the precision matriz) will result in a singular (or near singular) matriz, which inversion might not be possible. Even if this singularity only happens because of the working precision of your computer/programming language.
Using pinv() is usually not recommended, because even if it allows to compute a regressor it will overfit the training data. Alternative solutions to working with singular matriz is adding $delta~I$ to $R$
where $delta$ is a small constant (usually 1 to 10 eps) and $I$ is the identity.

Normal equation for linear regression is illogical

One Answer

Add your own answers!

Ask a Question