Off-Policy Differentiable Logic Reinforcement Learning

Zhang, Li; Li, Xin; Wang, Mingzhong; Tian, Andong

doi:10.1007/978-3-030-86520-7_38

Cited by 15 publications

(19 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many estimators for G φ H have been proposed [11] [14], such as the per-trajectory IS estimator, per-step estimator, weighted estimator, and doubly robust estimator [15] [16]. However, these estimators cannot be applied directly because in the node dropout setting the state-action to state transition terms must be handled appropriately.…”

Section: B Transformed Policy Importance Samplingmentioning

confidence: 99%

Corrected: On Confident Policy Evaluation for Factored Markov Decision Processes with Node Dropouts

Carmel¹,

Kar²,

Sinopoli³

2023

Preprint

View full text Add to dashboard Cite

In this work we investigate an importance sampling approach for evaluating policies for a structurally timevarying factored Markov decision process (MDP), i.e. the policy's value is estimated with a high-probability confidence interval. In particular, we begin with a multi-agent MDP controlled by a known policy but with unknown transition dynamics. One agent is then removed from the system -i.e. the system experiences node dropout -forming a new MDP of the remaining agents, with a new state space, action space, and new transition dynamics. We assume that the effect of removing an agent corresponds to the marginalization of its factor in the transition dynamics. The reward function may likewise be marginalized, or it may be entirely redefined for the new system. Robust policy importance sampling is then used to evaluate candidate policies for the new system, and estimated values are presented with probabilistic confidence bounds. This computation is completed with no observations of the new system, meaning that a safe policy may be found before dropout occurs. The utility of this approach is demonstrated in simulation and compared to Monte Carlo simulation of the new system.1 Carmel Fiscko and Soummya Kar are with the Dept.

show abstract

Section: B Transformed Policy Importance Samplingmentioning

confidence: 99%

Corrected: On Confident Policy Evaluation for Factored Markov Decision Processes with Node Dropouts

Carmel¹,

Kar²,

Sinopoli³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…H−1 h=0 V ( Ĵh (π)). However, we can still compute the variance and an variance estimator via a recursive form (Jiang and Li, 2016).…”

Section: Extension To Rlmentioning

confidence: 99%

“…OPE has been used successfully for many real world systems, such as recommendation systems (Li et al, 2011) and digital marketing (Thomas et al, 2017), to select a good policy to be deployed in the real world. A variety of estimators have been proposed, particularly based on importance sampling (IS) (Hammersley and Handscomb, 1964) reduce variance, such as self-normalization (Swaminathan and Joachims, 2015b), direct methods that use reward models and variance reduction techniques like the doubly robust (DR) estimator (Dudík et al, 2011;Jiang and Li, 2016;Thomas and Brunskill, 2016). Often high-confidence estimation is key, with the goal to estimate confidence intervals around these value estimates that maintain coverage without being too loose (Thomas et al, 2015a,b;Swaminathan and Joachims, 2015a;Kuzborskij et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments

Liu¹,

Chandak²,

Thomas³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this work, we consider the off-policy policy evaluation problem for contextual bandits and finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical for policy evaluation, but existing estimators that reuse old data introduce large bias such that we can not obtain a valid confidence interval. Inspired from a related field called survey sampling, we introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias. The estimator unifies several existing off-policy policy evaluation methods and improves on them with the use of auxiliary information and a regression approach. We prove that the new estimator is asymptotically unbiased, and provide a consistent variance estimator to a construct a large sample confidence interval. Finally, we empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.

show abstract

“…The IPS estimator often faces a high variance [7] which can be reduced by a self-normalized inverse propensity scoring (SNIPS) estimator [30]. Furthermore, Doubly Robust (DR) estimator [11,38] is proposed to simultaneously consider imputation errors and propensities in a doubly robust, for reducing the high variance in IPS.…”

Section: Recommendation With Selection Biasmentioning

confidence: 99%

“…The robustness and accuracy of the inverse probability estimation is the key to the counterfactual learning for the recommendation systems. The imputation errors and propensities are simultaneously considered in a doubly robust way for recommendation on MNAR [38] and reinforcement learning [11].…”

Section: Introductionmentioning

confidence: 99%

Uncertainty Calibration for Counterfactual Propensity Estimation in Recommendation

Sun¹,

Liu²,

Wu³

2023

Preprint

View full text Add to dashboard Cite

In recommendation systems, a large portion of the ratings are missing due to the selection biases, which is known as Missing Not At Random. The counterfactual inverse propensity scoring (IPS) was used to weight the imputation error of every observed rating. Although effective in multiple scenarios, we argue that the performance of IPS estimation is limited due to the uncertainty miscalibration of propensity estimation. In this paper, we propose the uncertainty calibration for the propensity estimation in recommendation systems with multiple representative uncertainty calibration techniques. Theoretical analysis on the bias and generalization bound shows the superiority of the calibrated IPS estimator over the uncalibrated one. Experimental results on the coat and yahoo datasets shows that the uncertainty calibration is improved and hence brings the better recommendation results.

show abstract

Off-Policy Differentiable Logic Reinforcement Learning

Cited by 15 publications

References 12 publications

Corrected: On Confident Policy Evaluation for Factored Markov Decision Processes with Node Dropouts

Corrected: On Confident Policy Evaluation for Factored Markov Decision Processes with Node Dropouts

Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments

Uncertainty Calibration for Counterfactual Propensity Estimation in Recommendation

Contact Info

Product

Resources

About