2012
DOI: 10.1002/nav.21481
|View full text |Cite
|
Sign up to set email alerts
|

A least squares temporal difference actor–critic algorithm with applications to warehouse management

Abstract: This paper develops a new approximate dynamic programming algorithm for Markov decision problems and applies it to a vehicle dispatching problem arising in warehouse management. The algorithm is of the actor-critic type and uses a least squares temporal difference learning method. It operates on a sample-path of the system and optimizes the policy within a prespecified class parameterized by a parsimonious set of parameters. The method is applicable to a partially observable Markov decision process setting whe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 19 publications
(8 citation statements)
references
References 28 publications
0
7
0
Order By: Relevance
“…Hence, it can not be guaranteed to obtain a global optimal solution. Convergence results [22,23] establish that it converges to a neighborhood of a stationary point of the expected average reward with probability one (w.p.1).…”
Section: Refining the Moth Control Policymentioning
confidence: 98%
See 1 more Smart Citation
“…Hence, it can not be guaranteed to obtain a global optimal solution. Convergence results [22,23] establish that it converges to a neighborhood of a stationary point of the expected average reward with probability one (w.p.1).…”
Section: Refining the Moth Control Policymentioning
confidence: 98%
“…One approach to solve this problem is to use an actor-critic algorithm [21]. This paper uses a modified version of a Least-Squares Temporal Difference (LSTD) actor-critic algorithm developed in [22].…”
Section: Refining the Moth Control Policymentioning
confidence: 99%
“…As a property of inner products, the linear approximation in (6) does not change the estimate of the gradient α ¯ ( θ ) if the optimal coefficient r ★ in (7) is used. Furthermore, the linear approximation reduces the complexity of learning from the space R | X | | U | to the space R n , where n is the dimensionality of θ (Konda, 2002; Estanjini et al, 2012).…”
Section: Control Synthesismentioning
confidence: 99%
“…Algorithm 1 learns the critic parameters using a LSTD method, which has been shown to be superior to other stochastic learning methods in terms of the convergence rate (Konda and Tsitsiklis, 2003; Boyan, 1999). Estanjini et al (2012) proposed and established the convergence of a LSTD actor–critic method similar to Algorithm 1 for problems of minimizing expected average costs . In comparison, the goal of the Problem 3.10 in this paper is to minimize an expected total cost (cf.…”
Section: Control Synthesismentioning
confidence: 99%
See 1 more Smart Citation