“…OPE has been used successfully for many real world systems, such as recommendation systems (Li et al, 2011) and digital marketing (Thomas et al, 2017), to select a good policy to be deployed in the real world. A variety of estimators have been proposed, particularly based on importance sampling (IS) (Hammersley and Handscomb, 1964) reduce variance, such as self-normalization (Swaminathan and Joachims, 2015b), direct methods that use reward models and variance reduction techniques like the doubly robust (DR) estimator (Dudík et al, 2011;Jiang and Li, 2016;Thomas and Brunskill, 2016). Often high-confidence estimation is key, with the goal to estimate confidence intervals around these value estimates that maintain coverage without being too loose (Thomas et al, 2015a,b;Swaminathan and Joachims, 2015a;Kuzborskij et al, 2021).…”