On the Expressivity of Markov Reward (Extended Abstract)

Zhang, Xindi

doi:10.24963/ijcai.2022/730

Cited by 8 publications

(16 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the two policies, the avg. costs are g(1) = R and g(2) = pS + (1 − p)R. Strangely, we must set R > S in order for g(2) < g (1).…”

Section: Motivating Examplesmentioning

confidence: 99%

“…In other words, probability-optimal policies are those that satisfy the entirety of the task, both desired and required behaviors, whereas V P π,λ ≡ (J π + λg π )P[π |= ϕ] is the normalized value function 1 , corresponding to a notion of energy or effort required, with λ representing the tradeoff between gain and transient cost. We will often omit the dependence of V on P and λ for brevity.…”

Section: Problem Formulationmentioning

confidence: 99%

“…Hence, the SSP guarantee (our center term) and the standard RL guarantee are very similar. The first term 1 β does not appear in standard RL literature because there is no constraint verification needed, but in practice will be dominated by the other terms. The last term is also similar to the center term.…”

Section: B Analysis: Statements With Proof B1 Sample Complexity Guara...mentioning

confidence: 99%

“…, applying union bound over all (s, a, s ) ∈ S × A × S, and observing that Z i ∼ P (s, a, s ) is a Bernoulli random variable with empirical variance Vn = P (s, a, s )(1 − P (s, a, s )) yields the result: {∀s, a, s ∈ S × A × S, ∀n > 1 : |P (s, a, s ) − P (s, a, s )| ≤ ψ sas (n)} holds with prob 1 − δ Observing that ψ sas (n) ≤ ψ(n) for all n > 1 because ψ sas (n) takes on a maximum when P (s, a, s ) = 1 2 , completes the proof. Lemma B.2 (Inverting E).…”

Section: B2 High Probability Event and Sample Requirement Definition ...mentioning

confidence: 99%

“…By combining these objectives into scalar costs, one erases the distinction between these two categories of tasks. Also, there is recent theoretical evidence that certain tasks are simply not reducible to scalar costs [1] (see Section 2). In practice, one circumvents these challenges using heuristics such as adding "breadcrumbs" [52].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Policy Optimization with Linear Temporal Logic Constraints

Voloshin¹,

Le²,

Chaudhuri³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study the problem of policy optimization (PO) with linear temporal logic (LTL) constraints. The language of LTL allows flexible description of tasks that may be unnatural to encode as a scalar cost function. We consider LTL-constrained PO as a systematic framework, decoupling task specification from policy selection, and an alternative to the standard of cost shaping. With access to a generative model, we develop a model-based approach that enjoys a sample complexity analysis for guaranteeing both task satisfaction and cost optimality (through a reduction to a reachability problem). Empirically, our algorithm can achieve strong performance even in low sample regimes.Preprint. Under review.

show abstract

“…For the two policies, the avg. costs are g(1) = R and g(2) = pS + (1 − p)R. Strangely, we must set R > S in order for g(2) < g (1).…”

Section: Motivating Examplesmentioning

confidence: 99%

Section: Problem Formulationmentioning

confidence: 99%

Section: B Analysis: Statements With Proof B1 Sample Complexity Guara...mentioning

confidence: 99%

Section: B2 High Probability Event and Sample Requirement Definition ...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Policy Optimization with Linear Temporal Logic Constraints

Voloshin¹,

Le²,

Chaudhuri³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Specification-Guided Learning of Nash Equilibria with High Social Welfare

Jothimurugan

Bansal

Bastani

et al. 2022

Computer Aided Verification

View full text Add to dashboard Cite

Reinforcement learning has been shown to be an effective strategy for automatically training policies for challenging control problems. Focusing on non-cooperative multi-agent systems, we propose a novel reinforcement learning framework for training joint policies that form a Nash equilibrium. In our approach, rather than providing low-level reward functions, the user provides high-level specifications that encode the objective of each agent. Then, guided by the structure of the specifications, our algorithm searches over policies to identify one that provably forms an$$\epsilon $$ϵ-Nash equilibrium (with high probability). Importantly, it prioritizes policies in a way that maximizes social welfare across all agents. Our empirical evaluation demonstrates that our algorithm computes equilibrium policies with high social welfare, whereas state-of-the-art baselines either fail to compute Nash equilibria or compute ones with comparatively lower social welfare.

show abstract

Beyond Markov Decision Process with Scalar Markovian Rewards

Miura

2022

SOCS

View full text Add to dashboard Cite

Real-world decision problems often involve multiple competing objectives or a complex reward structure that violate Markov assumption. However, the existing research on sequential decision making under uncertainty primarily focused on Markov Decision Processes (MDPs) with scalar Markovian reward signals. My thesis considers settings where scalar Markovian rewards are not sufficient to produce desired behaviors. The first part of my thesis develops algorithms to optimize lexicographically ordered objectives. The second part considers autonomous agents which incorporate the perspective of their observer. As the perspective of the observer can depend on how the agents behaved so far, rewards in this setting can depend on histories (non-Markovian). In the final part of my thesis, I hope to characterize when rewards beyond scalar Markovian signals are needed from the decision theoretic perspective

show abstract

On the Expressivity of Markov Reward (Extended Abstract)

Cited by 8 publications

References 31 publications

Policy Optimization with Linear Temporal Logic Constraints

Policy Optimization with Linear Temporal Logic Constraints

Specification-Guided Learning of Nash Equilibria with High Social Welfare

Beyond Markov Decision Process with Scalar Markovian Rewards

Contact Info

Product

Resources

About