“…In goal-based RL where future states can inform "optimal" reward parameters with respect to the transitions' actions, hindsight methods were applied successfully to enable effective training of goal-based Q-function for sparse rewards , derive exact connections between Q-learning and classic model-based RL , dataefficient off-policy hierarchical RL (Nachum et al, 2018), multi-task RL Li et al, 2020), offline RL (Chebotar et al, 2021), and more Choi et al, 2021;Ren et al, 2019;Zhao & Tresp, 2018;Ghosh et al, 2021;Nasiriany et al, 2021). Additionally, Lynch et al (2019) and Gupta et al (2018) have shown that often BC is sufficient for learning generalizable parameterized policies, due to rich positive examples from future states, and most recently Chen et al (2021a) and Janner et al (2021), when combined with powerful transformer architectures (Vaswani et al, 2017), it produced state-of-the-art offline RL and goal-based RL results. Lastly, while motivated from alternative mathematical principles and not for parameterized objectives, future state information was also explored as ways of reducing variance or improving estimations for generic policy gradient methods (Pinto et al, 2017;Guo et al, 2021;Venuto et al, 2021).…”