Temporal difference learning (TD) is a simple iterative algorithm widely used for policy evaluation in Markov reward processes. Bhandari et al. prove finite time convergence rates for TD learning with linear function approximation. The analysis follows using a key insight that establishes rigorous connections between TD updates and those of online gradient descent. In a model where observations are corrupted by i.i.d. noise, convergence results for TD follow by essentially mirroring the analysis for online gradient descent. Using an information-theoretic technique, the authors also provide results for the case when TD is applied to a single Markovian data stream where the algorithm’s updates can be severely biased. Their analysis seamlessly extends to the study of TD learning with eligibility traces and Q-learning for high-dimensional optimal stopping problems.
Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a simple and explicit finite time analysis of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Final sections of the paper show how all of our main results extend to the study of TD learning with eligibility traces, known as TD(λ), and to Q-learning applied in high-dimensional optimal stopping problems.
One of the central challenges in online advertising is attribution, namely, assessing the contribution of individual advertiser actions such as emails, display ads, and search ads to eventual conversion. Several heuristics are used for attribution in practice; however, most do not have any formal justification. The main contribution in this work is to propose an axiomatic framework for attribution in online advertising. We show that the most common heuristics can be cast under the framework and illustrate how these may fail. We propose a novel attribution metric, which we refer to as counterfactual adjusted Shapley value (CASV), which inherits the desirable properties of the traditional Shapley value while overcoming its shortcomings in the online advertising context. We also propose a Markovian model for the user journey through the conversion funnel, in which ad actions may have disparate impacts at different stages. We use the Markovian model to compare our metric with commonly used metrics. Furthermore, under the Markovian model, we establish that the CASV metric coincides with an adjusted “unique-uniform” attribution scheme. This scheme is efficiently implementable and can be interpreted as a correction to the commonly used uniform attribution scheme. We supplement our theoretical developments with numerical experiments using a real-world large-scale data set. This paper was accepted by David Simchi-Levi, revenue management and market analytics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.