Verification of Markov Decision Processes Using Learning Algorithms

Brázdil, Tomǎš; Chatterjee, Krishnendu; Chmelík, Martin; Forejt, Vojtěch; Křetínský, Jan; Kwiatkowska, Marta; Parker, David; Ujma, Mateusz

doi:10.1007/978-3-319-11936-6_8

Cited by 152 publications

(247 citation statements)

References 39 publications

Supporting

Mentioning

245

Contrasting

Order By: Relevance

“…Moreover, we gave results over the convergence speed, as well as criteria for obtaining exact convergence. As future works, it seems particularly interesting to test this algorithm on real instances, as it is done in [2], where authors moreover apply machine learning techniques.…”

Section: Resultsmentioning

confidence: 99%

“…Interestingly, our approach has been realized in parallel of Brázdil et al [2] that solves a different problem with similar ideas. There, authors use some machine learning algorithm, namely real-time dynamic programming, in order to avoid to apply the full operator at each step of the value iteration, but rather to partially apply it based on some statistical test.…”

Section: Introductionmentioning

confidence: 99%

“…We thank the reviewer of RP 2014 that pointed out the similarities between our approach and [2] (to be presented at the next ATVA, in Nov. 2014).…”

mentioning

confidence: 99%

See 2 more Smart Citations

Reachability in MDPs: Refining Convergence of Value Iteration

Haddad

Monmege

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Markov Decision Processes (MDP) are a widely used model including both non-deterministic and probabilistic choices. Minimal and maximal probabilities to reach a target set of states, with respect to a policy resolving non-determinism, may be computed by several methods including value iteration. This algorithm, easy to implement and efficient in terms of space complexity, consists in iteratively finding the probabilities of paths of increasing length. However, it raises three issues: (1) defining a stopping criterion ensuring a bound on the approximation, (2) analyzing the rate of convergence, and (3) specifying an additional procedure to obtain the exact values once a sufficient number of iterations has been performed. The first two issues are still open and for the third one a "crude" upper bound on the number of iterations has been proposed. Based on a graph analysis and transformation of MDPs, we address these problems. First we introduce an interval iteration algorithm, for which the stopping criterion is straightforward. Then we exhibit convergence rate. Finally we significantly improve the bound on the number of iterations required to get the exact values.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Reachability in MDPs: Refining Convergence of Value Iteration

Haddad

Monmege

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Figure 7 illustrates the transition probability map at a velocity of 25 km/h. In this study, according to the Markov decision processes (MDPs) introduced in [26], the driving schedule was considered a finite MDP. The MDP comprises a set of state variables S = {(SOC(t), neng(t))| 0.2 ≤ SOC(t) ≤ 0.8, neng,min ≤ neng(t) ≤ neng,max}, set of actions a = {u_th(t)}, reward function r = f m  (s,a), and transition function psa, s', where psa, s' represents the probability of making a transition from state s to state s´ using action a.…”

Section: Statistic Information Of the Driving Schedulementioning

confidence: 99%

Reinforcement Learning–Based Energy Management Strategy for a Hybrid Electric Tracked Vehicle

Zou

Liu

et al. 2015

Energies

View full text Add to dashboard Cite

This paper presents a reinforcement learning (RL)-based energy management strategy for a hybrid electric tracked vehicle. A control-oriented model of the powertrain and vehicle dynamics is first established. According to the sample information of the experimental driving schedule, statistical characteristics at various velocities are determined by extracting the transition probability matrix of the power request. Two RL-based algorithms, namely Q-learning and Dyna algorithms, are applied to generate optimal control solutions. The two algorithms are simulated on the same driving schedule, and the simulation results are compared to clarify the merits and demerits of these algorithms. Although the Q-learning algorithm is faster (3 h) than the Dyna algorithm (7 h), its fuel consumption is 1.7% higher than that of the Dyna algorithm. Furthermore, the Dyna algorithm registers approximately the same fuel consumption as the dynamic programming-based global optimal solution. The computational cost of the Dyna algorithm is substantially lower than that of the stochastic dynamic programming.

show abstract

“…For the former, some SMC-like approaches have recently been developed. They either work by iteratively optimising the decisions of an explicitly-stored scheduler [4,9], or by sampling from the scheduler space and iteratively improving a set of candidate near-optimal schedulers [5]. The former are heavyweight techniques because the size of the description of the (memoryless) scheduler is significant, and in the worst case is the size of the state space.…”

Section: Introductionmentioning

confidence: 99%

Statistical Approximation of Optimal Schedulers for Probabilistic Timed Automata

D’Argenio

Hartmanns

Legay

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The verification of probabilistic timed automata involves finding schedulers that optimise their nondeterministic choices with respect to the probability of a property. In practice, approaches based on model checking fail due to state-space explosion, while simulation-based techniques like statistical model checking are not applicable due to the nondeterminism. We present a new lightweight on-the-fly algorithm to find near-optimal schedulers for probabilistic timed automata. We make use of the classical region and zone abstractions from timed automata model checking, coupled with a recently developed smart sampling technique for statistical verification of Markov decision processes. Our algorithm provides estimates for both maximum and minimum probabilities. We compare our new approach with alternative techniques, first using tractable examples from the literature, then motivate its scalability using case studies that are intractable to numerical model checking and challenging for existing statistical techniques.

show abstract

Verification of Markov Decision Processes Using Learning Algorithms

Cited by 152 publications

References 39 publications

Reachability in MDPs: Refining Convergence of Value Iteration

Reachability in MDPs: Refining Convergence of Value Iteration

Reinforcement Learning–Based Energy Management Strategy for a Hybrid Electric Tracked Vehicle

Statistical Approximation of Optimal Schedulers for Probabilistic Timed Automata

Contact Info

Product

Resources

About