Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds

Jain, Ayush; Gkanatsios, Nikolaos; Mediratta, Ishita; Fragkiadaki, Katerina

doi:10.1007/978-3-031-20059-5_24

Cited by 20 publications

(21 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Text-guided Graph Neural Network [17] conducts instance segmentation on the full scene to create candidate objects as input to a graph neural network [32]; InstanceRefer [39] selects instance candidates from the panoptic segmentation of point clouds; 3DVG-Transformer [40] uses outputs from an object proposal generation module to fully leverage contextual clues for cross-modal proposal disambiguation. The best performing work in this category, BUTD-DETR [20], uses box proposals from a pretrained detector and scene features from the full 3D scene to decode objects with a detection head. The Multi-View Transformer [18] separately models the scene by projecting the 3D scene to a multi-view space, to eliminate dependence on specific views and learn robust representations.…”

Section: Related Workmentioning

confidence: 99%

“…Modules in NS3D can be trained end-to-end with only the groundtruth referred objects as supervision; each can also be trained individually whenever additional labels are available. In this paper, we use a hybrid training objective similar to prior works [2,20]. Specifically, we use the groundtruth object category to compute a per-object classification loss L oce (applied to all prob c , where c is the category) and the groundtruth final target object to compute a per-expression loss L tce .…”

Section: Trainingmentioning

confidence: 99%

“…BUTD-DETR [20] 0.670 † 0.530 MVT [18] 0.645 0.584 3DVG-TRANSFORMER [40] 0.514 0.446 INSTANCEREFER [39] 0.480 0.454 TEXT-GUIDED-GNNS [17] 0.450 0.458…”

Section: D Referring Expression Comprehensionmentioning

confidence: 99%

“…We report experiments on data efficiency compared to four top-performing prior work on ReferIt3D, two objectcentric methods (SAT [37] and TransRefer [16]) and two methods that model the full 3D scene (BUTD-DETR [20] and MVT [18]). We test on 0.…”

Section: Data Efficiencymentioning

confidence: 99%

“…6,584 examples) in the train set, with the same full test set. We note that BUTD-DETR [20] uses pretrained object classification results on the full ScanNet dataset [13], while others do not. Hence in Table 2, we report NS3D's performance on both settings: pretrained object classification on the full ReferIt3D train set (NS3D + Full), and on the smaller train set only (NS3D).…”

Section: Data Efficiencymentioning

confidence: 99%

See 4 more Smart Citations

NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations

Hsu¹,

Mao²

2023

Preprint

View full text Add to dashboard Cite

Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Trainingmentioning

confidence: 99%

“…BUTD-DETR [20] 0.670 † 0.530 MVT [18] 0.645 0.584 3DVG-TRANSFORMER [40] 0.514 0.446 INSTANCEREFER [39] 0.480 0.454 TEXT-GUIDED-GNNS [17] 0.450 0.458…”

Section: D Referring Expression Comprehensionmentioning

confidence: 99%