“…Since then, RMs have been used for solving problems in planning (Illanes et al, 2019(Illanes et al, , 2020, robotics DeFazio & Zhang, 2021;Camacho et al, 2020Camacho et al, , 2021, multi-agent systems (Neary et al, 2021), lifelong RL (Zheng et al, 2021), and partial observability (Toro Icarte et al, 2019a also considered both Mealy and Moore versions of RMs, though theirs only output numbers (like our simple RMs) instead of reward functions. Finally, there has been prominent work on how to learn RMs from experience (e.g., Toro Icarte et al, 2019aIcarte et al, , 2019bXu et al, 2020aXu et al, , 2020bFurelos-Blanco et al, 2020aRens & Raskin, 2020;Hasanbeig et al, 2021;Velasquez et al, 2021) Since our previous work, we have gained practical experience and new theoretical insights about reward machines -which were reflected in this paper. In particular, we provided a cleaner definition of reward machines and QRM.…”