Policy-based RL method - how do continuous actions look like?

Question

I've read several times that Policy-based RL methods can work with continuous action space (move left 5 meters, move right 5.5312 meters), rather than with discrete actions, like Value-based methods (Q-learning)
If Policy-based methods produce probability of taking an action from a current state $S_t$, how can such an action be continuous? We might get 4 probabities for our actions, but we still have to choose a single action from from four:
$A_1:  10%$
$A_2:  35%$
$A_3:  5%$
$A_4:  50%$
Thus, it's not obvious how can my action be something continuous like:  "turn +19.2345 angles clockwise". Such an action must have already been pre-defined to the "19.2345" value, right?

Neil Slater · Accepted Answer

The main requirement of on-policy policy gradient methods is that they use a parametric policy $pi(a|s, theta)$ that is differentiable with respect to the parameters $theta$.

This is not restricted to only describing discrete probability distribution functions (e.g. softmax output layer of neural network). A description of any probability distribution function that is differentiable and possible to sample from is all that is required. This is true for the Normal distribution for instance, so one relatively common solution in continuous spaces is for a neural network to output the mean and standard deviation for the distribution of each component of the action vector that accepts continuous values.

Typically the neural network does not perform the sampling to choose an action. This is also true for a softmax output - it is only additional code, outside of the NN, that interprets the values and selects the action. In addition, and unlike softmax, the NN does not need to directly represent the probability distribution function, just enough data to drive the sampling process. However, the nature of the distribution function does need to be taken into account when calculating the gradient in policy gradient methods.

Thus, it's not obvious how can my action be something continuous like: "turn +19.2345 angles clockwise". Such an action must have already been pre-defined to the "19.2345" value, right?

What the policy might output here, is the two parameters of the distribution $mathcal{N}(mu, sigma)$, which you then must sample to get an action like "turn x degrees clockwise". So for example, the neural network could output $(25, 7)$ and then additional code will interpret those values as describing the distribution and take a sample. If you got a mean 25, standard deviation 7, then at the point you select the action you could get "turn +19.2345 degrees clockwise" amongst a range of other values. The value 19.2345 does not need to be pre-defined or represented in the neural network in order to do that.

Policy-based RL method - how do continuous actions look like?

One Answer

Add your own answers!

Ask a Question