TransWikia.com

Why L2 norm in AdaGrad update equation not L1?

Data Science Asked by Osama El-Ghonimy on May 10, 2021

The update equation of AdaGrad is as follows:

enter image description here

I understand that sparse features have small updates and this is a problem. I understand that the idea of AdaGrad is to make the update speed (learning rate) of a parameter is inversely proportional to the update history of that parameter ($eta$ / sum of previous updates). This is independent of the other parameters. This makes the small number of updates done to sparse features have a higher learning rate than dense ones.

My question is about how to implement this in the above equation. Why are we summing the squares of update history and getting its square root? I understand that we need to get rid of the negative sign. So, why not directly summing the absolute values?

One Answer

The reason is because the article introducing the method, which can be found here, has proved bounds on the regret function using inequalities about the L2 norm. The square is also explained by the fact that is the diagonal of some positive quadratic form.

However I don't know if anyone has ever considered using AdaGrag with L1 norm instead of L2.

Answered by user10676 on May 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP