Mathematics Asked by evangelos on December 1, 2021
According to Appendix A.4.1 of Boyd & Vandenberghe’s Convex Optimization, the gradient of $f(X):=log det X$ is
$$nabla f(X) = X^{-1}$$
The domain of the $f$ here is the set of symmetric matrices $mathbf S^n$. However, according to the book "Matrix Algebra from a Statistician’s Perspective" by D. Harville, $log det X$ for a symmetric $X$ must be (see eq. 8.12 of book)
$$log det X = 2 X^{-1} – text{diag} (y_{11}, y_{22}, dots, y_{nn})$$
where $y_{ii}$ represents the $i$th element on the diagonal of $X^{-1}$. Now I’m not a mathematician but to me the formula of Harville seems correct, because he makes use of the fact that the entries of $X$ are not "independent". Indeed, in the case where the entries are ”independent”, Harville provides another formula (eq. 8.8 of his book), which matches that of Boyd & Vandenberghe.
Is this an error on the book of Boyd & Vandenberghe, or am I missing something here? To me it does seem like an error, but at the same time I find this extremely unlikely as the book is very popular and if it were an error it would already be on Errata; it’s much more likely that I’m misunderstanding something. This formula has already been mentioned in many questions in this website, but no question or answer that I saw mentions (the possibility of) $log det X$ in Boyd & Vandenberghe being wrong.
Edit based on response of Profs. Boyd & Vandenberghe
Prof. Boyd kindly responded to my email about this issue, provided an explanation that he and Lieven Vandenberghe think can can explain the discrepancy between the two formula. In essence, their reply suggests that the discrepancy can be due to the inner product choice. To better explain why, I need to summarize their proof in Appendix A.4.1 of the Convex Optimization book.
The proof is based on the idea that the derivative of a function gives the first-order approximation of the function. That is, the derivative of $f(X)$ can be obtained by finding a matrix $f(X)$ that satisfies
$$f(X+Delta X) approx f(X)+langle D,Delta Xrangle.$$
In the book Boyd&Vandenberghe use the $text{trace}(cdot)$ function as the inner product $langle cdot, cdot rangle$, and show that
$$f(X+Delta X) approx f(X)+text{trace}(X^{-1}Delta X).$$
The book is publicly available; how they arrived at this expression can be seen in the Appendix A.4.1. In their reply, Prof. Boyd suggests that they suspect the discrepancy to stem from the inner product use. While they used $text{trace}(cdot)$, he suggests that some other people may use $langle A,Brangle = sum_{i<=j} A_{ij}B_{ij}$ . Authors claim that this can explain the discrepancy (although I’m not sure if they looked at the proof of Harville or others about the implicit or non-implicit usage of this inner product), because the trace function puts twice as much weight on the off-diagonal entries.
Some questions where Boyd & Vanderberghe’s formula is mentioned:
$ defp#1#2{frac{partial #1}{partial #2}} defg#1#2{p{#1}{#2}} defm#1{left[begin{array}{r}#1end{array}right]} $To summarize the example used in the accepted answer $$eqalign{ phi(A) &= logdet(A) = log(ab-x^2) \ A &= m{a&x\x&b} quadimpliesquad g{phi}{A} = frac{1}{ab-x^2}m{b&-2x\-2x&a} \ }$$ Let's use this in a first-order Taylor expansion (use a colon to denote the matrix inner product) $$eqalign{ phi(A+dA) &= phi(A) + g{phi}{A}:dA \ dphi &= g{phi}{A}:dA \ &= frac{1}{ab-x^2}m{b&-2x\-2x&a}:m{da&dx\dx&db} \ &= frac{a,db-4x,dx+b,da}{ab-x^2} \ }$$ which disagrees with the direct (non-matrix) calculation $$eqalign{ dlog(ab-x^2) &= frac{a,db-2x,dx+b,da}{ab-x^2} \ }$$ On the other hand, using Boyd's result for the matrix calculation yields $$eqalign{ dphi &= g{phi}{A}:dA = frac{1}{ab-x^2}m{b&-x\-x&a}:m{da&dx\dx&db} = frac{a,db-2x,dx+b,da}{ab-x^2} \ }$$ which is correct.
Carefully read the Srinivasan-Panda paper (which has been mentioned in other answers) for an explanation of why Harville (and many other references) are mistaken.
Harville's quantity may be useful in certain contexts, but it is not a gradient.
Answered by greg on December 1, 2021
This is a really well done paper that describes what is going on:
Shriram Srinivasan, Nishant Panda. (2020) "What is the gradient of a scalar function of a symmetric matrix?" https://arxiv.org/pdf/1911.06491.pdf
Their conclusion is that Boyd's formula is the correct one, which comes by restricting the Frechet derivative (defined in $mathbb{R}^{n times n}$) to the subspace of symmetric n x n matrices, denoted $mathbb{S}^{n times n}$. Deriving the gradient in the reduced space of $n(n+1)/2$ dimensions and then mapping back to $mathbb{S}^{n times n}$ is subtle and can't be done so simply, leading to the inconsistent result by Harville.
Answered by ahwillia on December 1, 2021
Let me call $X_0$ the symmetric matrix with entries $(X_0)_{i,j} = x_{i,j}$. We have by assumptions $x_{i,j}=x_{j,i}$. Since $X_0$ is symmetric it can be diagonalized (if it's real). Its determinant is the product of the eigenvalues $lambda_k$. So for a symmetric matrix $X$
$$ lndet X = sum_k ln(lambda_k ) $$
Assume $X$ depends on a parameter $t$. It's derivative would be
$$ frac{d}{dt} lndet X(t) = sum_k frac{dot{lambda}_k}{lambda_k} $$
Say we want the derivative of $X_0$ with respect to $x_{i,j}$ for $ineq j$. Then, defining
begin{align} V &= |irangle langle j | + |jrangle langle i | \ X(t) &= X_0 +tV, end{align}
($V$ is the matrix with all zeros except ones at position $(i,j)$ and $(j,i)$). We have
$$ frac{partial}{partial x_{i,j}} lndet X_0 = left . frac{d}{dt} lndet X(t) right vert_{t=0}= sum_k frac{dot{lambda}_k}{lambda_k} $$
Now
$$ dot{lambda}_k = langle v_k | V| v_k rangle $$
where $|v_k rangle$ is the eigenvector of $X_0$ corresponding to $lambda_k$. Hence (for $ineq j$)
begin{align} frac{partial}{partial x_{i,j}} lndet X_0 & = sum_k frac{ langle j| v_k rangle langle v_k |i rangle }{lambda_k} + i leftrightarrow j \ &= left ( X^{-1} right)_{j,i} +left ( X^{-1} right)_{i,j} \ &= 2left ( X^{-1} right)_{i,j} end{align}
Let us now compute the derivative with respect to $x_{i,i}$. We reason exactly as before with $V = |irangle langle i |$ and we get
begin{align} frac{partial}{partial x_{i,i}} lndet X_0 & = sum_k frac{ langle i| v_k rangle langle v_k |i rangle }{lambda_k} \ &= left ( X^{-1} right)_{i,i}. end{align}
Hence the second formula is the correct one for a symmetric matrix. The first formula is correct for a non symmetric matrix. All formulae require of course the matrix to be non-singular.
Added
Let's explain the subtlety with one example that should clarify the matter. Conside the following symmetric matrix:
$$ A=left(begin{array}{cc} a & x\ x & b end{array}right) $$
Now,
$$logdet(A) = log(ab-x^2)$$
and so
begin{align} frac{partial logdet(A)}{partial a } &= frac{b}{ab-x^2} \ frac{partial logdet(A)}{partial x } &= - frac{2x}{ab-x^2} \ frac{partial logdet(A)}{partial b } &= frac{a}{ab-x^2} end{align}
And compare this with
$$ A^{-1} = frac{1}{(ab-x^2)} left(begin{array}{cc} b & -x\ -x & a end{array}right) $$
This simple calculation agrees with the formula above (cfr. the factor of 2). As I said in the comment, the point is to be clear about what are the independent variables or what is the variation that we are using. Here I considered variation $V$ which is symmetric, as this seems to be the problem's assumption.
Obviously if you consider
$$ A'=left(begin{array}{cc} a & y\ x & b end{array}right) $$
you will obtain $nabla A' sim {A'}^{-1}$
Answered by lcv on December 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP