Sparse Kernel-SARSA(λ) with an Eligibility Trace

Robards, Matthew; Sunehag, Peter; Sanner, Scott; Marthi, Bhaskara

doi:10.1007/978-3-642-23808-6_1

Cited by 8 publications

(8 citation statements)

References 13 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mountain Car is one of the domains from the standard benchmarks in the NIPS Workshop 2005. They are used to evaluate the performances of various reinforcement learning algorithms [1], [20]- [23], [26], [31]. Note that OSKTD is a method for prediction.…”

Section: Methodsmentioning

confidence: 99%

“…2) Approach of Kernel Redefinition: Let us consider the elements β (s, s i ) and k(s, s i ) in the value function (31). Let…”

Section: Properties Of the Selective Kernel-based Value Functionmentioning

confidence: 99%

“…Nonparametric dynamic programming (NPDP) is a nonparametric approach to policy evaluation, which uses kernel density estimation [30]. Kernel-SARSA(λ) [31] is a kernelized version of SARSA(λ) using sparse kernel projectron (a bounded kernel-based perceptron [32], [33]) techniques to permit kernel sparsification for arbitrary λ for 0 ≤ λ ≤ 1.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Online Selective Kernel-Based Temporal Difference Learning

Chen

Wang

2013

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Abstract-In this paper, an online selective kernel-based temporal difference (OSKTD) learning algorithm is proposed to deal with large scale and/or continuous reinforcement learning problems. OSKTD includes two online procedures: online sparsification and parameter updating for the selective kernelbased value function. A new sparsification method (i.e., a kernel distance-based online sparsification method) is proposed based on selective ensemble learning, which is computationally less complex compared with other sparsification methods. With the proposed sparsification method, the sparsified dictionary of samples is constructed online by checking if a sample needs to be added to the sparsified dictionary. In addition, based on local validity, a selective kernel-based value function is proposed to select the best samples from the sample dictionary for the selective kernel-based value function approximator. The parameters of the selective kernel-based value function are iteratively updated by using the temporal difference (TD) learning algorithm combined with the gradient descent technique. The complexity of the online sparsification procedure in the OSKTD algorithm is O(n). In addition, two typical experiments (Maze and Mountain Car) are used to compare with both traditional and up-to-date O(n) algorithms (GTD, GTD2, and TDC using the kernel-based value function), and the results demonstrate the effectiveness of our proposed algorithm. In the Maze problem, OSKTD converges to an optimal policy and converges faster than both traditional and up-to-date algorithms. In the Mountain Car problem, OSKTD converges, requires less computation time compared with other sparsification methods, gets a better local optima than the traditional algorithms, and converges much faster than the upto-date algorithms. In addition, OSKTD can reach a competitive ultimate optima compared with the up-to-date algorithms.Index Terms-Function approximation, online sparsification, reinforcement learning (RL), selective ensemble learning, selective kernel-based value function.

show abstract

Section: Methodsmentioning

confidence: 99%

“…2) Approach of Kernel Redefinition: Let us consider the elements β (s, s i ) and k(s, s i ) in the value function (31). Let…”

Section: Properties Of the Selective Kernel-based Value Functionmentioning

confidence: 99%

See 1 more Smart Citation

Online Selective Kernel-Based Temporal Difference Learning

Chen

Wang

2013

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…Studies of individual and organizational decision making document several types of failures to learn from real‐world feedback based on the gaps between what was expected and what actually occurred, or between what was achieved and what could have been achieved by better decisions (if this is known) . Formal models of how to adaptively modify decision processes or decision rules to reduce regret—for example, by selecting actions next time a situation is encountered in a Markov decision process, or in a game against nature (with an unpredictable, possibly adversarial, environment) using probabilities that reflect cumulative regret for not having used each action in such situations in the past—require explicitly collecting and analyzing such data . Less formally, continually assessing the outcomes of decisions and how one might have done better, as required by the regret‐minimization framework, means that opportunities to learn from experience will more often be exploited instead of missed. Increase experimentation and adaptation .…”

Section: Doing Better: Using Predictable Rational Regret To Improve Bcamentioning

confidence: 99%

“…In this case, formal models of regret reduction typically require exploring different decision rules to find out what works best. Such learning strategies (called “on‐policy” learning algorithms, since they learn only from experience with the policy actually used, rather than from information about what would have happened if something different had been tried) have been extensively developed and applied successfully to regret reduction in machine learning and game theory . They adaptively weed out the policies that are followed by the least desirable consequences, and increase the selection probabilities for policies that are followed by preferred consequences.…”

Section: Doing Better: Using Predictable Rational Regret To Improve Bcamentioning

confidence: 99%

Overcoming Learning Aversion in Evaluating and Managing Uncertain Risks

Cox¹

2015

Risk Analysis

View full text Add to dashboard Cite

Decision biases can distort cost-benefit evaluations of uncertain risks, leading to risk management policy decisions with predictably high retrospective regret. We argue that well-documented decision biases encourage learning aversion, or predictably suboptimal learning and premature decision making in the face of high uncertainty about the costs, risks, and benefits of proposed changes. Biases such as narrow framing, overconfidence, confirmation bias, optimism bias, ambiguity aversion, and hyperbolic discounting of the immediate costs and delayed benefits of learning, contribute to deficient individual and group learning, avoidance of information seeking, underestimation of the value of further information, and hence needlessly inaccurate risk-cost-benefit estimates and suboptimal risk management decisions. In practice, such biases can create predictable regret in selection of potential risk-reducing regulations. Low-regret learning strategies based on computational reinforcement learning models can potentially overcome some of these suboptimal decision processes by replacing aversion to uncertain probabilities with actions calculated to balance exploration (deliberate experimentation and uncertainty reduction) and exploitation (taking actions to maximize the sum of expected immediate reward, expected discounted future reward, and value of information). We discuss the proposed framework for understanding and overcoming learning aversion and for implementing low-regret learning strategies using regulation of air pollutants with uncertain health effects as an example.

show abstract

Gradient Based Algorithms with Loss Functions and Kernels for Improved On-Policy Control

Robards

Sunehag

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Sparse Kernel-SARSA(λ) with an Eligibility Trace

Cited by 8 publications

References 13 publications

Online Selective Kernel-Based Temporal Difference Learning

Online Selective Kernel-Based Temporal Difference Learning

Overcoming Learning Aversion in Evaluating and Managing Uncertain Risks

Gradient Based Algorithms with Loss Functions and Kernels for Improved On-Policy Control

Contact Info

Product

Resources

About