“…Localizing objects described by referring expressions in vision signals, also known as visual grounding, has long been a major motive for robotics and embodied vision. So far, we have seen growing efforts devoted to visual grounding in images [17,36,13,40,24,29,33,5,41,11,42,10,9,12,19,47,18,35,38,39,20] and videos [46,45,43,37,30,31,44]. Suppose that a robot is going to take 'the spoon on the table in the kitchen' following your command [14,23]; this would require a Figure 1: We present a novel task of 3D visual grounding in single-view RGBD images given a referring expression, and propose a bottom-up neural approach to address it.…”