TransWikia.com

Prioritized Replay, what does Importance Sampling really do?

Data Science Asked on March 5, 2021

I can’t understand the purpose of importance-sampling weights (IS) in Prioritized Replay (page 5).

A transition is more likely to be sampled from experience replay the larger its "cost" is. My understanding is that ‘IS’ helps with smoothely abandoning the use of prioritized replay after we’ve trained for long enough. But what do we use instead, uniform sampling?

I guess I can’t realize how each component in such a coefficient is affecting the outcome. Could someone explain it in words?

$$w_i = left( frac{1}{N}cdot frac{1}{P(i)} right) ^beta$$

It’s then used to dampen the gradient, which we try to get from transitions.

Where:

  • $w_i$ is "IS"
  • N is the size of Experience Replay buffer
  • P(i) is the chance to select transition $i$, depending on "how fat its cost is".
  • $beta$ starts from 0.4 and is dragged closer and closer to 1 with each new epoch.

Is my understanding of these parameters also correct?

Edit Sometime after the answer was accepted I found an additional source, a video which might be helpful for beginners – MC Simmulations: 3.5 Importance Sampling


Edit As @avejidah said in the comment to his answer "$1/N$ is used to average the samples by the probability that they will be sampled".

To realise why it’s important, assume $beta$ is fixed to 1, we have 4 samples, each has $P(i)$ as follows:

0.1  0.2   0.3     0.4

That is, first entry has 10% of being chosen, second is 20% etc.
Now, inverting them, we get:

 10   5    3.333   2.5

Averaging via $1/N$ (which in our case is $1/4$) we get:

2.5  1.25  0.8325  0.625     ...which would add up to '5.21'

As we can see they are much closer to zero than the simply inverted versions ($10, 5, 3.333, 2.5$). This means the gradient for our network won’t be magnified as much, resulting in a lot less variance as we train our network.

So, without this $frac{1}{N}$were we lucky to select the least likely sample ($0.1$), the gradient would be scaled 10 times. It would be even worse with smaller values, say $0.00001$ chance, if our experience replay has many thousands entries, which is quite usual.

In other words, $frac{1}{N}$ is just to make your hyperparameters (such as learning-rate) not require adjustment, when there you change the size of your experience replay buffer.

2 Answers

DQN suffers intrinsically from instability. In the original implementation, multiple techniques are employed to improve stability:

  1. a target network is used with parameters that lag behind the trained model;
  2. rewards are clipped to the range [-1, 1];
  3. gradients are clipped to the range [-1, 1] (using something like Huber Loss or gradient clipping);
  4. and most relevant to your question, a large replay buffer is used to store transitions.

Continuing on point 4, using fully random samples from a large replay buffer helps to decorrelate the samples, because it's equally likely to sample transitions from hundreds of thousands of episodes in the past as it is to sample new ones. But when priority sampling is added into the mix, purely random sampling is abandoned: there's obviously a bias toward high-priority samples. To correct for this bias, the weights corresponding to high-priority samples are adjusted very little, whereas those corresponding to low-priority samples are relativity unchanged.

Intuitively this should make sense. Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "Train on these samples, but without much emphasis; they'll be seen again soon." Conversely, when a low-priority sample is seen, the IS weights basically tell the network, "This sample will likely never be seen again, so fully update." Keep in mind that these low-priority samples have a low TD-error anyway, and so there's probably not much to be learned from them; however, they're still valuable for stability purposes.

In practice, the beta parameter is annealed up to 1 over the duration of training. The alpha parameter can be annealed simultaneously, thereby making prioritized sampling more aggressive while at the same time more strongly correcting the weights. And in practice, from the paper you linked, keeping a fixed alpha (.6) while annealing the beta from .4 to 1 seems to be the sweet-spot for priority-based sampling (page 14).

As a side note, from my own personal experience, simply ignoring the IS weights (i.e. not correcting at all) results in a network that trains well at first, but then the network appears to overfit, forgets what it's learned (a.k.a. catastrophic forgetting), and tanks. On Atari Breakout, for example, the averages increase during the first 50 million or so frames, then the averages completely tank. The paper you linked discusses this a bit, and provides some charts.

Correct answer by benbotto on March 5, 2021

I have a doubt. As PER paper,

For stability reasons, we always normalize weights by 1/ maxi wi so that they only scale the update downwards

So doesn't 1/N factor become ineffective? for example, consider the last sample,

case 1 without N : 0.25/10 = 0.25
case 2 with N=4; 0.625/2.5 = 0.25.

so,

Wi = pow(N,-beta) * pow(Pi, -beta)
Wmax = pow(N,-beta) * pow(Pmin,-beta)

by normalizing,

Wi/Wmax will cancel out the pow(N, -beta).

Please help me if my understanding is wrong.

Answered by Karthikeyan Nagarajan on March 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP