Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Song, Zhao; Parr, Ronald; Carin, Lawrence

doi:10.48550/arxiv.1812.00456

Cited by 11 publications

(11 citation statements)

References 7 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, the DBS operator rectifies the convergence issue of the Boltzmann softmax operator with fixed parameters. Note that we also achieve a tighter error bound for the fixed-parameter softmax operator in general cases compared with Song et al (2018). In addition, we show that the DBS operator achieves good convergence rate.…”

Section: Introductionmentioning

confidence: 61%

“…In this section, we compare the error bound in Corollary 2 with that in Song et al (2018), which studies the error bound of the softmax operator with a fixed parameter β.…”

Section: Relation To Existing Resultsmentioning

confidence: 99%

“…The Boltzmann softmax operator is a natural value estimator Sutton & Barto (1998); Azar et al (2012); Cesa-Bianchi et al (2017) based on the Boltzmann softmax distribution, which is a natural scheme to address the exploration-exploitation dilemma and has been widely used in reinforcement learning Sutton & Barto (1998); Azar et al (2012); Cesa-Bianchi et al (2017). In addition, the Boltzmann softmax operator also provides benefits for reducing overestimation and gradient noise in deep Q-networks Song et al (2018). However, despite the advantages, it is challenging to apply the Boltzmann softmax operator in value function estimation.…”

Section: Introductionmentioning

confidence: 99%

“…It is crucial to note the DBS operator is the only one that meets all desired properties proposed in Song et al (2018) up to now, as it ensures Bellman optimality, enables overestimation reduction, directly represents a policy, can be applicable to double Q-learning Hasselt (2010), and requires no tuning.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Pan¹,

Cai²,

Qi³

et al. 2019

Preprint

View full text Add to dashboard Cite

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying dynamic Boltzmann softmax updates in deep Q-network, which outperforms DQN substantially in 40 out of 49 Atari games.

show abstract

Section: Introductionmentioning

confidence: 61%

“…In this section, we compare the error bound in Corollary 2 with that in Song et al (2018), which studies the error bound of the softmax operator with a fixed parameter β.…”

Section: Relation To Existing Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Pan¹,

Cai²,

Qi³

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…These layers are implemented using ReLu activation functions with the linear output layer. For both the DQNEnsemble-FSO/RF and DQN-FSO/RF agents, actions are selected using the Boltzman policy [41]. Our evaluations consider the use of Adam optimizer to minimize the loss function given in Equation 17.…”

Section: B Evaluation Set-upmentioning

confidence: 99%

Ensemble Consensus-based Representation Deep Reinforcement Learning for Hybrid FSO/RF Communication Systems

Henna¹

2021

Preprint

View full text Add to dashboard Cite

Hybrid FSO/RF system requires an efficient FSO and RF link switching mechanism to improve the system capacity by realizing the complementary benefits of both the links. The dynamics of network conditions, such as fog, dust, and sand storms compound the link switching problem and control complexity. To address this problem, we initiate the study of deep reinforcement learning (DRL) for link switching of hybrid FSO/RF systems. Specifically, in this work, we focus on actor-critic called Actor/Critic-FSO/RF and Deep-Q network (DQN) called DQN-FSO/RF for FSO/RF link switching under atmospheric turbulences. To formulate the problem, we define the state, action, and reward function of a hybrid FSO/RF system. DQN-FSO/RF frequently updates the deployed policy that interacts with the environment in a hybrid FSO/RF system, resulting in high switching costs. To overcome this, we lift this problem to ensemble consensus-based representation learning for deep reinforcement called DQNEnsemble-FSO/RF. The proposed novel DQNEnsemble-FSO/RF DRL approach uses consensus learned features representations based on an ensemble of asynchronous threads to update the deployed policy. Experimental results corroborate that the proposed DQNEnsemble-FSO/RF's consensuslearned features switching achieves better performance than Actor/Critic-FSO/RF, DQN-FSO/RF, and MyOpic for FSO/RF link switching while keeping the switching cost significantly low.

show abstract