A least squares temporal difference actor–critic algorithm with applications to warehouse management

Estanjini, Reza Moazzez; Li, Keyong; Paschalidis, Ioannis Ch.

doi:10.1002/nav.21481

Cited by 19 publications

(8 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hence, it can not be guaranteed to obtain a global optimal solution. Convergence results [22,23] establish that it converges to a neighborhood of a stationary point of the expected average reward with probability one (w.p.1).…”

Section: Refining the Moth Control Policymentioning

confidence: 98%

See 1 more Smart Citation

Learning from animals: How to Navigate Complex Terrains

et al. 2020

Self Cite

View full text Add to dashboard Cite

We develop a method to learn a bio-inspired motion control policy using data collected from hawkmoths navigating in a virtual forest. A Markov Decision Process (MDP) framework is introduced to model the dynamics of moths and sparse logistic regression is used to learn control policy parameters from the data. The results show that moths do not favor detailed obstacle location information in navigation, but rely heavily on optical flow. Using the policy learned from the moth data as a starting point, we propose an actor-critic learning algorithm to refine policy parameters and obtain a policy that can be used by an autonomous aerial vehicle operating in a cluttered environment. Compared with the moths' policy, the policy we obtain integrates both obstacle location and optical flow. We compare the performance of these two policies in terms of their ability to navigate in artificial forest areas. While the optimized policy can adjust its parameters to outperform the moth's policy in each different terrain, the moth's policy exhibits a high level of robustness across terrains.

show abstract

Section: Refining the Moth Control Policymentioning

confidence: 98%

“…One approach to solve this problem is to use an actor-critic algorithm [21]. This paper uses a modified version of a Least-Squares Temporal Difference (LSTD) actor-critic algorithm developed in [22].…”

Section: Refining the Moth Control Policymentioning

confidence: 99%

Learning from animals: How to Navigate Complex Terrains

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…As a property of inner products, the linear approximation in (6) does not change the estimate of the gradient ∇ α ¯ ( θ ) if the optimal coefficient r ★ in (7) is used. Furthermore, the linear approximation reduces the complexity of learning from the space R | X | | U | to the space R n , where n is the dimensionality of θ (Konda, 2002; Estanjini et al, 2012).…”

Section: Control Synthesismentioning

confidence: 99%

“…Algorithm 1 learns the critic parameters using a LSTD method, which has been shown to be superior to other stochastic learning methods in terms of the convergence rate (Konda and Tsitsiklis, 2003; Boyan, 1999). Estanjini et al (2012) proposed and established the convergence of a LSTD actor–critic method similar to Algorithm 1 for problems of minimizing expected average costs . In comparison, the goal of the Problem 3.10 in this paper is to minimize an expected total cost (cf.…”

Section: Control Synthesismentioning

confidence: 99%

“…By approximating both the policy and state–action value function with a parameterized structure, actor–critic algorithms use much less memory compared with other dynamic programming techniques. In particular, actor–critic algorithms with least squares temporal difference (LSTD) learning have been shown recently to be a powerful tool for large-sized problems (Konda and Tsitsiklis, 2003; Estanjini et al, 2011, 2012). Ding et al (2011) showed that a motion control problem with temporal logic specifications could be converted to a maximal reachability probability (MRP) problem, i.e.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Temporal logic motion control using actor–critic methods

Wang

Ding

Lahijanian

et al. 2015

The International Journal of Robotics Research

Self Cite

View full text Add to dashboard Cite

Abstract-In this paper, we consider the problem of deploying a robot from a specification given as a temporal logic statement about some properties satisfied by the regions of a large, partitioned environment. We assume that the robot has noisy sensors and actuators and model its motion through the regions of the environment as a Markov Decision Process (MDP). The robot control problem becomes finding the control policy maximizing the probability of satisfying the temporal logic task on the MDP. For a large environment, obtaining transition probabilities for each state-action pair, as well as solving the necessary optimization problem for the optimal policy are usually not computationally feasible. To address these issues, we propose an approximate dynamic programming framework based on a least-square temporal difference learning method of the actor-critic type. This framework operates on sample paths of the robot and optimizes a randomized control policy with respect to a small set of parameters. The transition probabilities are obtained only when needed. Hardware-in-theloop simulations confirm that convergence of the parameters translates to an approximately optimal policy.

show abstract

Performance optimization for a class of generalized stochastic Petri nets

Reveliotis

2014

Discrete Event Dyn Syst

View full text Add to dashboard Cite

Abstract-This paper considers the problem of optimizing the (long-term) performance of operations that are modeled by Generalized Stochastic Petri nets. The proposed methodology employs the representational power of the GSPN framework in order to articulate an explicit trade-off between the computational tractability of the formulated problem and the operational efficiency of the derived solutions. On the other hand, the solution of the considered formulations is based on recent results regarding the sensitivity analysis of Markov reward processes. A more expansive treatment of the presented results, together with a case study that highlights the relevance of the considered problem and the efficacy of the proposed methodology, can be found in a companion document that is accessible from the website of the second author.

show abstract

A least squares temporal difference actor–critic algorithm with applications to warehouse management

Cited by 19 publications

References 28 publications

Learning from animals: How to Navigate Complex Terrains

Learning from animals: How to Navigate Complex Terrains

Temporal logic motion control using actor–critic methods

Performance optimization for a class of generalized stochastic Petri nets

Contact Info

Product

Resources

About