“…DT utilizes a Transformer architecture to model and reproduce sequences from demonstrations, integrating a goal-conditioned policy to convert Offline RL into a supervised learning task. Despite its competitive performance in Offline RL tasks, the DT falls short in achieving trajectory stitching, a desirable property in Offline RL that refers to creating an optimal trajectory by combining parts of sub-optimal trajectories [19,9,57]. This limitation stems from the DT's inability to generate superior sequences, thus curbing its potential to learn optimal policies from sub-optimal trajectories (Figure 1).…”