Artificial Intelligence Asked by sirfroggy on December 16, 2021
It is proved that the Bellman update is a contraction (1).
Here is the Bellman update that is used for Q-Learning:
$$Q_{t+1}(s, a) = Q_{t}(s, a) + alpha*(r(s, a, s’) + gamma max_{a^*} (Q_{t}(s’,
a^*)) – Q_t(s,a)) tag{1} label{1}$$
The proof of (ref{1}) being contraction comes from one of the facts (the relevant one for the question) that max operation is non expansive; that is:
$$lvert max_a f(a)- max_a g(a) rvert leq max_a lvert f(a) – g(a) rvert tag{2}label{2}$$
This is also proved in a lot of places and it is pretty intuitive.
Consider the following Bellman update:
$$ Q_{t+1}(s, a) = Q_{t}(s, a) + alpha*(r(s, a, s’) + gamma SAMPLE_{a^*} (Q_{t}(s’, a^*)) – Q_t(s,a)) tag{3}label{3}$$
where $SAMPLE_a(Q(s, a))$ samples an action with respect to the Q values (weighted by their Q values) of each action in that state.
Is this new Bellman operation still a contraction?
Is the SAMPLE operation non-expansive? It is, of course, possible to generate samples that will not satisfy equation (ref{2}). I ask is it non-expansive in expectation?
My approach is:
$$lvert,mathbb{E}_{a sim Q}[f(a)] – mathbb{E}_{a sim Q}[g(a)], rvert leq ,,mathbb{E}_{a sim Q}lvert,,[f(a) – g(a)],,rvert tag{4} label{4} $$
Equivalently:
$$lvert,mathbb{E}_{a sim Q}[f(a) – g(a)] , rvert leq ,,mathbb{E}_{a sim Q}lvert,,[f(a) – g(a)],,rvert$$
(ref{4}) is true since:
$$lvert,mathbb{E}[X] , rvert leq ,,mathbb{E} ,,lvert,,[X],,rvert $$
But, I am not sure if proving (ref{4}) proves the theorem. Do you think that this is a legit proof that (ref{3}) is a contraction.
(If so; this would mean that stochastic policy q learning theoretically converges and we can have stochastic policies with regular q learning; and this is why I am interested.)
Both intuitive answers and mathematical proofs are welcome.
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP