2019
DOI: 10.48550/arxiv.1911.06854
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
48
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 33 publications
(51 citation statements)
references
References 21 publications
1
48
0
Order By: Relevance
“…For our experiments, we utilize the environments and implementations of baseline estimators in the Caltech OPE Benchmarking Suite (COBS) [Voloshin et al, 2019]. In this section, we present results on the Graph and Toy Mountain Car environments.…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…For our experiments, we utilize the environments and implementations of baseline estimators in the Caltech OPE Benchmarking Suite (COBS) [Voloshin et al, 2019]. In this section, we present results on the Graph and Toy Mountain Car environments.…”
Section: Resultsmentioning
confidence: 99%
“…Notice that in the n-step q-estimate, returns are backed up from possible future outcomes, whereas in the n-step interpolation estimators the probabilities are 'backed-up' from the possible histories. (In the diagram, bias-variance characterization of PDIS and SIS is based on typical practical observations [Voloshin et al, 2019, Fu et al, 2021, however it is worth noting that SIS is not biased when oracle density ratios are available, and there are also edge cases, particularly for short horizon problems, where SIS can have higher variance than PDIS [Liu et al, 2020, Metelli et al, 2020).…”
Section: Combining Trajectory-based and Density-based Importance Samp...mentioning
confidence: 99%
See 2 more Smart Citations
“…( 10) can be more flexible to handle arbitrary initial state-action pairs. And for value-based methods, such as Fitted Q-Evaluation(FQE) (e.g, Voloshin et al, 2019), though empirically better than density-based ones (Fu et al, 2021), usually cannot handle multiple reward functions simultaneously.…”
Section: More Related Workmentioning
confidence: 99%