“…Traditional studies have focused on techniques for detecting 3D objects [GD07,GSEH11] and predicting affordances based on human poses [DFL*12,GGVG11, FDG*12]. In more recent works, there has been an emphasis on generating realistic static poses within the context of a 3D scene [LLK*19, ZBT21, ZZM*20, HGT*21, ZWZ*22, HWL*23, GDG*23], leveraging newly collected datasets on human interactions [HCTB19, GMSPM21, BXP*22,TGBT20]. Compared to the studies focused on creating static poses, our work tackles a task of generating coherent motions that align with the scene.…”