Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Voloshin, Cameron; Le, Hoang M.; Jiang, Nan; Yue, Yisong

doi:10.48550/arxiv.1911.06854

Cited by 33 publications

(51 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For our experiments, we utilize the environments and implementations of baseline estimators in the Caltech OPE Benchmarking Suite (COBS) [Voloshin et al, 2019]. In this section, we present results on the Graph and Toy Mountain Car environments.…”

Section: Resultsmentioning

confidence: 99%

“…Notice that in the n-step q-estimate, returns are backed up from possible future outcomes, whereas in the n-step interpolation estimators the probabilities are 'backed-up' from the possible histories. (In the diagram, bias-variance characterization of PDIS and SIS is based on typical practical observations [Voloshin et al, 2019, Fu et al, 2021, however it is worth noting that SIS is not biased when oracle density ratios are available, and there are also edge cases, particularly for short horizon problems, where SIS can have higher variance than PDIS [Liu et al, 2020, Metelli et al, 2020).…”

Section: Combining Trajectory-based and Density-based Importance Samp...mentioning

confidence: 99%

“…For our experiments, we used the Graph, Toy Mountain Car, and standard Mountain Car [Brockman et al, 2016] domains provided in the COBS library. We include a brief description of each of these domains below, and a full description of each can be found in the work by Voloshin et al [2019].…”

Section: D1 Experimental Set-upmentioning

confidence: 99%

“…The Toy-MC environment [Voloshin et al, 2019] is a tabular simplification of the classic Mountain Car domain. There are a total of 21 states: x 0 = 0 the starting point in the valley, 10 states to the left, and 10 states to the right.…”

Section: Toy Mountain Car Environmentmentioning

confidence: 99%

See 3 more Smart Citations

SOPE: Spectrum of Off-Policy Estimators

Yuan¹,

Chandak²,

Giguere³

et al. 2021

Preprint

View full text Add to dashboard Cite

Many sequential decision making problems are high-stakes and require off-policy evaluation (OPE) of a new policy using historical data collected using some other policy. One of the most common OPE techniques that provides unbiased estimates is trajectory based importance sampling (IS). However, due to the high variance of trajectory IS estimates, importance sampling methods based on state-action visitation distributions (SIS) have recently been adopted. Unfortunately, while SIS often provides lower variance estimates for long horizons, estimating the stateaction distribution ratios can be challenging and lead to biased estimates. In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS. Additionally, we also establish a spectrum for doubly-robust and weighted version of these estimators. We provide empirical evidence that estimators in this spectrum can be used to trade-off between the bias and variance of IS and SIS and can achieve lower mean-squared error than both IS and SIS.35th Conference on Neural Information Processing Systems (NeurIPS 2021).

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Combining Trajectory-based and Density-based Importance Samp...mentioning

confidence: 99%

Section: D1 Experimental Set-upmentioning

confidence: 99%

Section: Toy Mountain Car Environmentmentioning

confidence: 99%

See 2 more Smart Citations

SOPE: Spectrum of Off-Policy Estimators

Yuan¹,

Chandak²,

Giguere³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…( 10) can be more flexible to handle arbitrary initial state-action pairs. And for value-based methods, such as Fitted Q-Evaluation(FQE) (e.g, Voloshin et al, 2019), though empirically better than density-based ones (Fu et al, 2021), usually cannot handle multiple reward functions simultaneously.…”

Section: More Related Workmentioning

confidence: 99%

Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning

Tang¹,

Feng²,

Liu³

2022

Preprint

View full text Add to dashboard Cite

Reinforcement learning (RL) has drawn increasing interests in recent years due to its tremendous success in various applications. However, standard RL algorithms can only be applied for single reward function, and cannot adapt to an unseen reward function quickly. In this paper, we advocate a general operator view of reinforcement learning, which enables us to directly approximate the operator that maps from reward function to value function. The benefit of learning the operator is that we can incorporate any new reward function as input and attain its corresponding value function in a zero-shot manner. To approximate this special type of operator, we design a number of novel operator neural network architectures based on its theoretical properties. Our design of operator networks outperform the existing methods and the standard design of general purpose operator network, and we demonstrate the benefit of our operator deep Q-learning framework in several tasks including reward transferring for offline policy evaluation (OPE) and reward transferring for offline policy optimization in a range of tasks.

show abstract

Sim-GAIL: A generative adversarial imitation learning approach of student modelling for intelligent tutoring systems

Li,

Shi,

Wang

et al. 2023

Neural Comput & Applic

View full text Add to dashboard Cite

The continuous application of artificial intelligence (AI) technologies in online education has led to significant progress, especially in the field of Intelligent Tutoring Systems (ITS), online courses and learning management systems (LMS). An important research direction of the field is to provide students with customised learning trajectories via student modelling. Previous studies have shown that customisation of learning trajectories could effectively improve students’ learning experiences and outcomes. However, training an ITS that can customise students’ learning trajectories suffers from cold-start, time-consumption, human labour-intensity, and cost problems. One feasible approach is to simulate real students’ behaviour trajectories through algorithms, to generate data that could be used to train the ITS. Nonetheless, implementing high-accuracy student modelling methods that effectively address these issues remains an ongoing challenge. Traditional simulation methods, in particular, encounter difficulties in ensuring the quality and diversity of the generated data, thereby limiting their capacity to provide intelligent tutoring systems (ITS) with high-fidelity and diverse training data. We thus propose Sim-GAIL, a novel student modelling method based on generative adversarial imitation learning (GAIL). To the best of our knowledge, it is the first method using GAIL to address the challenge of lacking training data, resulting from the issues mentioned above. We analyse and compare the performance of Sim-GAIL with two traditional Reinforcement Learning-based and Imitation Learning-based methods using action distribution evaluation, cumulative reward evaluation, and offline-policy evaluation. The experiments demonstrate that our method outperforms traditional ones on most metrics. Moreover, we apply our method to a domain plagued by the cold-start problem, knowledge tracing (KT), and the results show that our novel method could effectively improve the KT model’s prediction accuracy in a cold-start scenario.

show abstract

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Cited by 33 publications

References 21 publications

SOPE: Spectrum of Off-Policy Estimators

SOPE: Spectrum of Off-Policy Estimators

Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning

Sim-GAIL: A generative adversarial imitation learning approach of student modelling for intelligent tutoring systems

Contact Info

Product

Resources

About