Reinforcement Learning in Games

Szita, István

doi:10.1007/978-3-642-27645-3_17

Cited by 36 publications

(24 citation statements)

References 68 publications

(67 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The second is partially observable and extends the partially observable Markov decision process (POMDP) (Cassandra, 1998; Kaelbling et al, 1998;. The single-objective version of these models are widely used and applied in areas such as: communication networks (Altman, 2002), planning and scheduling (Scharpff et al, 2013), games (Szita, 2012) and robotics (Kober and Peters, 2012). The multi-objective models have been gaining traction relatively recently.…”

Section: Sequential Decision-makingmentioning

confidence: 99%

Multi-objective decision-theoretic planning

Roijers

2016

AI Matters

View full text Add to dashboard Cite

Decision making is hard. It o en requires reasoning about uncertain environments, partial observability and action spaces that are too large to enumerate. In such complex decisionmaking tasks decision-theoretic agents, that can reason about their environments on the basis of mathematical models and produce policies that optimize the utility for their users, can o en assist us.In most research on decision-theoretic agents, the desirability of actions and their e ects is codi ed in a scalar reward function. However, many real-world decision problems have multiple objectives. In such cases the problem is more naturally expressed using a vector-valued reward function. Rather than having a single optimal policy, we then want to produce a set of policies that covers all possible preferences between the objectives. We call such a set a coverage set. In this dissertation, we focus on decision-theoretic planning algorithms that produce the convex coverage set (CCS), which is the optimal solution set when either: 1) the user utility can be expressed as a weighted sum over the values for each objective; or 2) policies can be stochastic.We propose new methods based on two popular approaches to creating planning algorithms that produce an (approximate) CCS by building on an existing single-objective algorithm. In the inner loop approach, we replace the summations and maximizations in the inner most loops of the single-objective algorithm by cross-sums and pruning operations. In the outer loop approach, we solve a multi-objective problem as a series of scalarized problems by employing the single-objective method as a subroutine.Our most important contribution is an outer loop framework that we call optimistic linear support (OLS). As an outer loop method OLS builds the CCS incrementally. We show that, contrary to existing outer loop methods, each intermediate result is a bounded approximation of the CCS with known bounds (even when the single-objective method used is a bounded approximate method as well) and is guaranteed to terminate in a nite number of iterations.We apply OLS-based algorithms to a variety of multi-objective decision problems, and show that it is more memory-e cient, and faster than corresponding inner loop algorithms for moderate numbers of objectives. We show that exchanging subroutines in OLS is relatively easy and illustrate the importance on a complex planning problem. Finally, we show that it is o en possible to reuse parts of the policies and values, found in earlier iterations of OLS, to hot-start later iterations of OLS. Using this last insight, we propose the rst method for multi-objective POMDPs that employs point-based planning and can produce an ε-CCS in reasonable time.Overall, the methods we propose bring us closer to truly practical multi-objective decisiontheoretic planning.

show abstract

Section: Sequential Decision-makingmentioning

confidence: 99%

Multi-objective decision-theoretic planning

Roijers

2016

AI Matters

View full text Add to dashboard Cite

show abstract

“…There is partial evidence for many factors being involved here; Szita [16] suggested that policy representation (relying on function approximation), the presence of randomness, environment observability, and training regime are, among others, the critical factors. In this study, we focus on policy representation and, more specifically, on its dimensionality, meant as the number of variables/parameters that characterize candidate policies.…”

Section: Introductionmentioning

confidence: 97%

High-Dimensional Function Approximation for Knowledge-Free Reinforcement Learning

Jaśkowski

Szubert

Liskowski

et al. 2015

Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation

View full text Add to dashboard Cite

SZ-Tetris, a restricted version of Tetris, is a difficult reinforcement learning task. Previous research showed that, similarly to the original Tetris, value function-based methods such as temporal difference learning, do not work well for SZ-Tetris. The best performance in this game was achieved by employing direct policy search techniques, in particular the cross-entropy method in combination with handcrafted features. Nonetheless, a simple heuristic hand-coded player scores even higher. Here we show that it is possible to equal its performance with CMA-ES (Covariance Matrix Adaptation Evolution Strategy). We demonstrate that further improvement is possible by employing systematic n-tuple network, a knowledge-free function approximator, and VD-CMA-ES, a linear variant of CMA-ES for high dimension optimization. Last but not least, we show that a large systematic n-tuple network (involving more than 4 million parameters) allows the classical temporal difference learning algorithm to obtain similar average performance to VD-CMA-ES, but at 20 times lower computational expense, leading to the best policy for SZ-Tetris known to date. These results enrich the current understanding of difficulty of SZ-Tetris, and shed new light on the capabilities of particular search paradigms when applied to representations of various characteristics and dimensionality.

show abstract

“…the actions to take in the MDP states in order to maximise the obtained rewards (Wiering and Otterlo, 2012). Although successfully used in applications ranging from gaming (Szita, 2012) to robotics (Kober et al, 2013), standard RL is not applicable to problems where the policies synthesised by the agent must satisfy strict constraints associated with the safety, reliability, performance and other critical aspects of the problem.…”

Section: Introductionmentioning

confidence: 99%

Assured Reinforcement Learning with Formally Verified Abstract Policies

Călinescu¹,

Kudenko²,

Banks³

2017

Proceedings of the 9th International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

Markov Decision Processes. Abstract:We present a new reinforcement learning (RL) approach that enables an autonomous agent to solve decision making problems under constraints. Our assured reinforcement learning approach models the uncertain environment as a high-level, abstract Markov decision process (AMDP), and uses probabilistic model checking to establish AMDP policies that satisfy a set of constraints defined in probabilistic temporal logic. These formally verified abstract policies are then used to restrict the RL agent's exploration of the solution space so as to avoid constraint violations. We validate our RL approach by using it to develop autonomous agents for a flag-collection navigation task and an assisted-living planning problem.

show abstract

Reinforcement Learning in Games

Cited by 36 publications

References 68 publications

Multi-objective decision-theoretic planning

Multi-objective decision-theoretic planning

High-Dimensional Function Approximation for Knowledge-Free Reinforcement Learning

Assured Reinforcement Learning with Formally Verified Abstract Policies

Contact Info

Product

Resources

About