1965
DOI: 10.21236/ad0619764
|View full text |Cite
|
Sign up to set email alerts
|

A Modified Dynamic Programming Method for Markovian Decision Problems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0
1

Year Published

1968
1968
2014
2014

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(16 citation statements)
references
References 3 publications
(5 reference statements)
0
15
0
1
Order By: Relevance
“…Unfortunately, although most likely neither the true optimal percentile policy nor the approximate one put positive weight on these state-action pairs, Theorem 4 states that our confidence in the approximate policy should depend on this reduced number of transition observations from the given pairs. We apply the idea of action elimination, proposed by MacQueen (1966) in the context of the nominal MDP, to the percentile optimization framework to relax this dependence.…”
Section: Improving the Bound With Action Eliminationmentioning
confidence: 99%
“…Unfortunately, although most likely neither the true optimal percentile policy nor the approximate one put positive weight on these state-action pairs, Theorem 4 states that our confidence in the approximate policy should depend on this reduced number of transition observations from the given pairs. We apply the idea of action elimination, proposed by MacQueen (1966) in the context of the nominal MDP, to the percentile optimization framework to relax this dependence.…”
Section: Improving the Bound With Action Eliminationmentioning
confidence: 99%
“…In Section VI, the performance of our approximation scheme and Gauss-Seidel VI is compared numerically. The RVI algorithm, proposed by [4] for average reward problems and generalized to discounted reward settings by [5] and [6], is shown to have a (αβ RVI ) k convergence in [7]. The convergence is proved in terms of the relative value function, which is the difference between the value of each state and the value of a fixed pre-selected state, and β RVI < 1 is the second largest eigenvalue of the transition probability matrix corresponding to the optimal policy.…”
Section: A Related Literaturementioning
confidence: 99%
“…Such an implementation would need to use the entire node space or a layered-network approach. (We also attempted to use action elimination (MacQueen 1966), a technique that requires both an upper and a lower bound; however, its performance was not significantly better than dynamic programming. We used upper and lower bounds that were multiples of Euclidean distance.…”
Section: Bander and White Heuristic Search For Stochastic Shortest Pathsmentioning
confidence: 99%