Why L2 norm in AdaGrad update equation not L1?

Question

The update equation of AdaGrad is as follows:

I understand that sparse features have small updates and this is a problem. I understand that the idea of AdaGrad is to make the update speed (learning rate) of a parameter is inversely proportional to the update history of that parameter ($eta$ / sum of previous updates). This is independent of the other parameters. This makes the small number of updates done to sparse features have a higher learning rate than dense ones.
My question is about how to implement this in the above equation. Why are we summing the squares of update history and getting its square root? I understand that we need to get rid of the negative sign. So, why not directly summing the absolute values?

user10676 · Answer

The reason is because the article introducing the method, which can be found here, has proved bounds on the regret function using inequalities about the L2 norm.
The square is also explained by the fact that is the diagonal of some positive quadratic form.
However I don't know if anyone has ever considered using AdaGrag with L1 norm instead of L2.

Why L2 norm in AdaGrad update equation not L1?

One Answer

Add your own answers!

Ask a Question