Artificial Intelligence Asked by kosa on August 24, 2021
I am trying to test DQN on FrozenWorld environment in gym using TensorFlow 2.x. The update rule is (off policy)
$$Q(s,a) leftarrow Q(s,a)+alpha (r+gamma~ max_{a’}Q(s’,a’)-Q(s,a))$$
I am using an epsilon greedy policy.
In this environment, we get a reward only if we succeed. So I explored with 100% until I have 50 successes. Then I saved the data of failures and success in different bins. Then I sampled (with replacement) from these bins and used them to train the Q network. However, no matter how long I train the agent doesn’t seem to learn.
The code is available in Colab. I am doing this for a couple of days.
PS: I modified the code for SARSA and Expected SARSA; nothing works.
I see at least 3 issues with your DQN code that need to be fixed:
You should not have separate replay memories for successes/failures. Put all of your experiences in one replay memory and sample from it uniformly.
Your replay memory is extremely small with only 2,000 samples. You need to make it significantly larger; try at least 100,000 up to 1,000,000 samples.
Your batch_target
is incorrect. You need to train on returns and not just rewards. In your train
function, compute the 1-step return $r + gamma cdot max_{a'} Q(s',a')$, remembering to set $max_{a'} Q(s',a') = 0$ if $s'$ is terminal, and then pass it to model.fit()
as your prediction target.
Answered by Brett Daley on August 24, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP