“…There is a rich literature of work which studies interactive agents, and grounding their behaviors in language [9,10,11,12]. Many prior works have studied this problem in the context of instruction following, where an agent aims to complete a task specified by formal language/programs [13,14,15,16,17,18] or natural language [10,11,19,20]. While these approaches have been largely studied in simulated spatial games [19,21,22,23] or in object-directed visual navigation in simulated robots [24,25,26,27,28,29,23] some of which include high-level object interaction [30], in this work we focus on the domain of learning control for vision-based robotic manipulation.…”