2022
DOI: 10.48550/arxiv.2204.12026
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BATS: Best Action Trajectory Stitching

Abstract: The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement learning algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Proc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 7 publications
0
2
0
Order By: Relevance
“…Taking a different approach, [19] utilizes a model-based data augmentation strategy, stitching together parts of historical demonstrations to create superior trajectories. Similarly, the Best Action Trajectory Stitching (BATS) [9] algorithm forms a tabular Markov Decision Process over logged data, adding new transitions using short planned trajectories. BATS not only aids in identifying advantageous trajectories but also provides theoretical bounds on the value function.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Taking a different approach, [19] utilizes a model-based data augmentation strategy, stitching together parts of historical demonstrations to create superior trajectories. Similarly, the Best Action Trajectory Stitching (BATS) [9] algorithm forms a tabular Markov Decision Process over logged data, adding new transitions using short planned trajectories. BATS not only aids in identifying advantageous trajectories but also provides theoretical bounds on the value function.…”
Section: Related Workmentioning
confidence: 99%
“…DT utilizes a Transformer architecture to model and reproduce sequences from demonstrations, integrating a goal-conditioned policy to convert Offline RL into a supervised learning task. Despite its competitive performance in Offline RL tasks, the DT falls short in achieving trajectory stitching, a desirable property in Offline RL that refers to creating an optimal trajectory by combining parts of sub-optimal trajectories [19,9,57]. This limitation stems from the DT's inability to generate superior sequences, thus curbing its potential to learn optimal policies from sub-optimal trajectories (Figure 1).…”
mentioning
confidence: 99%