Artificial Intelligence Asked by Sergiu Ionescu on January 28, 2021
I am trying to implement an Actor Critic method that controls an RC car. For this I have implemented a simulated environment and actor critic tensorflowjs models.
My intention is to train a model to navigate an environment without colliding with various obstacles.
For this I have the following:
State(continuous):
Action(discrete):
Reward(cumulative):
The structure of the models:
buildActor() {
const model = tf.sequential();
model.add(tf.layers.inputLayer({inputShape: [this.stateSize],}));
model.add(tf.layers.dense({
units: parseInt(this.config.hiddenUnits),
activation: 'relu',
kernelInitializer: 'glorotUniform',
}));
model.add(tf.layers.dense({
units: parseInt(this.config.hiddenUnits/2),
activation: 'relu',
kernelInitializer: 'glorotUniform',
}));
model.add(tf.layers.dense({
units: this.actionSize,
activation: 'softmax',
kernelInitializer: 'glorotUniform',
}));
this.compile(model, this.actorLearningRate);
return model;
}
buildCritic() {
const model = tf.sequential();
model.add(tf.layers.inputLayer({inputShape: [this.stateSize],}));
model.add(tf.layers.dense({
units: parseInt(this.config.hiddenUnits),
activation: 'relu',
kernelInitializer: 'glorotUniform',
}));
model.add(tf.layers.dense({
units: parseInt(this.config.hiddenUnits/2),
activation: 'relu',
kernelInitializer: 'glorotUniform',
}));
model.add(tf.layers.dense({
units: this.valueSize,
activation: 'linear',
kernelInitializer: 'glorotUniform',
}));
this.compile(model, this.criticLearningRate);
return model;
}
The models are compiled with an adam optimized and huber loss:
compile(model, learningRate) {
model.compile({
optimizer: tf.train.adam(learningRate),
loss: tf.losses.huberLoss,
});
}
Training:
trainModel(state, action, reward, nextState) {
let advantages = new Array(this.actionSize).fill(0);
let normalizedState = normalizer.normalizeFeatures(state);
let tfState = tf.tensor2d(normalizedState, [1, state.length]);
let normalizedNextState = normalizer.normalizeFeatures(nextState);
let tfNextState = tf.tensor2d(normalizedNextState, [1, nextState.length]);
let predictedCurrentStateValue = this.critic.predict(tfState).dataSync();
let predictedNextStateValue = this.critic.predict(tfNextState).dataSync();
let target = reward + this.discountFactor * predictedNextStateValue;
let advantage = target - predictedCurrentStateValue;
advantages[action] = advantage;
// console.log(normalizedState, normalizedNextState, action, target, advantages);
this.actor.fit(tfState, tf.tensor([advantages]), {
epochs: 1,
}).then(info => {
this.latestActorLoss = info.history.loss[0];
this.actorLosses.push(this.latestActorLoss);
}
);
this.critic.fit(tfState, tf.tensor([target]), {
epochs: 1,
}).then(info => {
this.latestCriticLoss = info.history.loss[0];
this.criticLosses.push(this.latestCriticLoss);
}
);
this.advantages.push(advantage);
pushToEvolutionChart(this.epoch, this.latestActorLoss, this.latestCriticLoss, advantage);
this.epoch++;
}
You ca give the simulation a spin on https://sergiuionescu.github.io/esp32-auto-car/sym/sym.html .
I found that some behaviors are being picked up – the model learns to prioritize moving forward after a few episodes, but then hits the wall and it reprioritizes spinning – but seems to completely ‘forget’ that moving forward was ever prioritized.
I’ve been trying to follow https://keras.io/examples/rl/actor_critic_cartpole/ to a certain extent, but have not found an equivalent of the way back-propagation is handled there – GradientTape.
Is it possible to perform training similar to the Keras example in Tensorflowjs?
The theory i’ve went through on Actor Critic mentions that the Critic should estimate the reward yet to be obtain until the rest of the episode, but i am training the critic with:
reward + this.discountFactor * predictedNextStateValue
where reward is the cumulative reward until the current step.
Should i keep track of a maximum total reward in previous episodes and subtract my reward from that instead?
When i am training the actor i am generating a zero filled advantages tensor:
let advantages = new Array(this.actionSize).fill(0);
let target = reward + this.discountFactor * predictedNextStateValue;
let advantage = target - predictedCurrentStateValue;
advantages[action] = advantage;
All other actions than the taken one will receive a 0 advantage. Could this discourage any previous actions the were proven beneficial?
Should i average out the advantages per state and action?
Thanks for having the patience to go trough all of this.
After tinkering a bit more with my experiment, i got it to consistently manifest the intended behavior after around 200 episodes.
Changes to the model itself were minimal: i replaced the loss function on the actor to tf.losses.softmaxCrossEntropy
.
Some changes to the training environment seemed to have a significant impact and improved training:
Answered by Sergiu Ionescu on January 28, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP