“…Many learnable methods focus on reducing the need for manual labor and 3D assets in computer graphics [Holden et al 2017;Kuang et al 2022;Liu et al 2021;Starke et al 2019Starke et al , 2020, but only provide narrow video game functions. More related to our work, neural video game simulation methods show that annotated videos can be used to learn to generate videos interactively [Davtyan and Favaro 2022;Kim et al , 2020Menapace et al 2021] and build 3D environments where agents can be controlled through a set of discrete actions [Menapace et al 2022]. While bringing us closer to learnable game engines, when applied to complex or realworld environments, these works have several limitations: do not accurately model game logic, do not model physical interactions of objects in 3D space, do not learn fine-grained controls, do not allow for high-level goal-driven control of the game flow, and, finally, do not model game AI.…”