I implemented a self-critical policy gradient (as described here), for text summarization.
However, after training, the results are not as high as expected (actually lower than without RL…).
I’m looking for general guidelines on how to debug RL-based algorithms.
I tried :
1e-4 in the paper)
The only resource I could find so far :
For my specific case, I made a few errors :
Even if I could overfit a small dataset, it didn't mean anything : while training on the whole dataset, the average reward was not going up.
You should look for a reward going up.
I'm not accepting this answer as I believe it is not complete : it lacks general and systematic guidelines to debug a Reinforcement Learning algorithm.
Answered by Astariul on November 28, 2020
Get help from others!