Data Science Asked by Osama El-Ghonimy on May 10, 2021
The update equation of AdaGrad is as follows:
I understand that sparse features have small updates and this is a problem. I understand that the idea of AdaGrad is to make the update speed (learning rate) of a parameter is inversely proportional to the update history of that parameter ($eta$ / sum of previous updates). This is independent of the other parameters. This makes the small number of updates done to sparse features have a higher learning rate than dense ones.
My question is about how to implement this in the above equation. Why are we summing the squares of update history and getting its square root? I understand that we need to get rid of the negative sign. So, why not directly summing the absolute values?
The reason is because the article introducing the method, which can be found here, has proved bounds on the regret function using inequalities about the L2 norm. The square is also explained by the fact that is the diagonal of some positive quadratic form.
However I don't know if anyone has ever considered using AdaGrag with L1 norm instead of L2.
Answered by user10676 on May 10, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP