Cross Validated Asked by SomethingSomething on January 12, 2021
In Andrew Ng’s Machine Learning course lecture 4.6 on "Normal Equation", he says that in order to minimize $J(theta) = frac{1}{2m}sumlimits_{i=1}^{m}({h_{theta}}(x^{(i)}) – y^{(i)})^2$, where $h_{theta}(x) = theta_{0} + theta_{1}x_1 + theta_{2}x_2 + … + theta_{n}x_n$ and solve for $theta$, one should take the design matrix $X$ and compute the following expression:
$theta = (X^{T}X)^{-1}X^{T}y$ ,
where the design matrix is the matrix of all feature vectors $[1, x^{(i)}_{1}, x^{(i)}_{2}, …, x^{(i)}_{m}]$ as rows. He shows the Octave (Matlab) code for computing it, as pinv(x'*x)*x'*y
.
However, long time ago, when I used Numpy to solve the same problem, I just used np.linalg.pinv(x) @ y
. It is even stated in Numpy’s pinv docs that pinv
solves the least squares problem for $Ax=b$, such that $overline{x} = A^{+}b$.
So why should I compute $theta = (X^{T}X)^{-1}X^{T}y$ when I can just compute $theta = X^{-1}y$ ? Is there any difference?
Actually, it is easy to see that $theta = (X^{T}X)^{-1}X^{T}y$ is right, because by definition,
$(X^{T}X)^{-1}(X^{T}X) = I$,
but thanks to the associative property of matrix multiplication, we can write the same equation as
$((X^{T}X)^{-1})X^{T})X = I$,
such that we get that by multiplying $X$ from the left by $(X^{T}X)^{-1})X^{T}$, we get $I$, meaning that it is its left-inverse matrix. The left-inverse is the matrix that is used for solving the least-squares problem, as multiplying both sides by it from the left turns $Xtheta=y$ into $Itheta=(X^TX)^{-1}X^Ty$, meaning that the coefficients are $theta = (X^TX)^{-1}X^Ty$.
Similarly, the following equation is true by definition,
$(XX^{T})(XX^{T})^{-1} = I$,
which again, thanks to the associative property of matrix multiplication, can be written as
$X(X^{T}(XX^{T})^{-1}) = I$,
So we get that $(X^{T}(XX^{T})^{-1})$ is the right-inverse of $X$.
$X^{-1}$ only makes sense for square matrices. For least-squares problems, we often have a strictly-skinny, full-rank matrix $X$. When $X$ is square and full rank, then you can use $theta = X^{-1}y$, as $(X^TX)^{-1}X^T = X^{-1}$. But when $X$ is strictly-skinny and full-rank, only the pseudoinverse $(X^TX)^{-1}X^T$ exists, which leads to the formula given in the link.
Correct answer by user303375 on January 12, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP