Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

Question

It is proved that the Bellman update is a contraction (1).
Here is the Bellman update that is used for Q-Learning:
$$Q_{t+1}(s, a) = Q_{t}(s, a) + alpha*(r(s, a, s') + gamma  max_{a^*} (Q_{t}(s',
   a^*)) - Q_t(s,a)) tag{1} label{1}$$
The proof of (ref{1}) being contraction comes from one of the facts (the relevant one for the question) that max operation is non expansive; that is:
$$lvert max_a f(a)- max_a g(a) rvert leq max_a lvert f(a) -  g(a) rvert tag{2}label{2}$$
This is also proved in a lot of places and it is pretty intuitive.
Consider the following Bellman update:
$$ Q_{t+1}(s, a) = Q_{t}(s, a) + alpha*(r(s, a, s') + gamma  SAMPLE_{a^*} (Q_{t}(s', a^*)) - Q_t(s,a)) tag{3}label{3}$$
where $SAMPLE_a(Q(s, a))$ samples an action with respect to the Q values (weighted by their Q values) of each action in that state.
Is this new Bellman operation still a contraction?
Is the SAMPLE operation non-expansive? It is, of course, possible to generate samples that will not satisfy equation (ref{2}). I ask is it non-expansive in expectation?
My approach is:
$$lvert,mathbb{E}_{a sim Q}[f(a)] - mathbb{E}_{a sim Q}[g(a)], rvert leq ,,mathbb{E}_{a sim Q}lvert,,[f(a) - g(a)],,rvert tag{4} label{4} $$
Equivalently:
$$lvert,mathbb{E}_{a sim Q}[f(a) - g(a)] , rvert leq ,,mathbb{E}_{a sim Q}lvert,,[f(a) - g(a)],,rvert$$
(ref{4}) is true since:
$$lvert,mathbb{E}[X] , rvert leq ,,mathbb{E} ,,lvert,,[X],,rvert $$
But, I am not sure if proving (ref{4}) proves the theorem. Do you think that this is a legit proof that (ref{3}) is a contraction.
(If so; this would mean that stochastic policy q learning theoretically converges and we can have stochastic policies with regular q learning; and this is why I am interested.)
Both intuitive answers and mathematical proofs are welcome.

Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

Add your own answers!

Ask a Question