“…Distributional Reinforcement Learning and State-Marginal Matching Modeling the full distribution of returns instead of the averages led to the development of distributional RL algorithms (Bellemare et al, 2017;Castro et al, 2018;Barth-Maron et al, 2018) such as Categorical Q-learning (Bellemare et al, 2017). While our work shares techniques such as discretization and binning, these works focus on optimizing a non-conditional reward-maximizing RL policy and therefore our problem definition is closer to that of state-marginal matching algorithms (Hazan et al, 2019;Lee et al, 2020;Ghasemipour et al, 2020;, or equivalently inverse RL algorithms (Ziebart et al, 2008;Ho & Ermon, 2016;Finn et al, 2016;Fu et al, 2018;Ghasemipour et al, 2020) whose connections to feature-expectation matching have been long discussed (Abbeel & Ng, 2004). However, those are often exclusively online algorithms even sample-efficient variants (Kostrikov et al, 2019), since density-ratio estimations with either discriminative (Ghasemipour et al, 2020) or generative (Lee et al, 2020) approach requires on-policy samples, with a rare exception of Kostrikov et al (2020).…”