2021
DOI: 10.48550/arxiv.2111.02997
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

Abstract: In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step. Our update step … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 37 publications
0
1
0
Order By: Relevance
“…Non-asymptotic analyses for critic only methods have been extensively studied recently, e.g., TD Lakshminarayanan & Szepesvari, 2018;Bhandari et al, 2018;Cai et al, 2019;Sun et al, 2019;, SARSA (Zou et al, 2019), gradient TD (GTD) method (Dalal et al, 2018;Xu et al, 2019;Wang et al, 2021;2017;Liu et al, 2015;Gupta et al, 2019;Kaledin et al, 2020;Ma et al, 2020;Wang & Zou, 2020). There are also non-asymptotic analyses for actor only method, e.g., (Bhandari & Russo, 2021;Agarwal et al, 2021;Mei et al, 2020;Li et al, 2021a;Laroche & des Combes, 2021;Zhang et al, 2021;Cen et al, 2021;Zhang et al, 2020a;Lin, 2022). In this paper, we focus on AC and NAC algorithms, where how the errors in the actor and the critic affects the other needs to be analyzed.…”
Section: Related Workmentioning
confidence: 99%
“…Non-asymptotic analyses for critic only methods have been extensively studied recently, e.g., TD Lakshminarayanan & Szepesvari, 2018;Bhandari et al, 2018;Cai et al, 2019;Sun et al, 2019;, SARSA (Zou et al, 2019), gradient TD (GTD) method (Dalal et al, 2018;Xu et al, 2019;Wang et al, 2021;2017;Liu et al, 2015;Gupta et al, 2019;Kaledin et al, 2020;Ma et al, 2020;Wang & Zou, 2020). There are also non-asymptotic analyses for actor only method, e.g., (Bhandari & Russo, 2021;Agarwal et al, 2021;Mei et al, 2020;Li et al, 2021a;Laroche & des Combes, 2021;Zhang et al, 2021;Cen et al, 2021;Zhang et al, 2020a;Lin, 2022). In this paper, we focus on AC and NAC algorithms, where how the errors in the actor and the critic affects the other needs to be analyzed.…”
Section: Related Workmentioning
confidence: 99%