2020
DOI: 10.1007/978-3-030-58452-8_25
|View full text |Cite
|
Sign up to set email alerts
|

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
154
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 75 publications
(155 citation statements)
references
References 50 publications
1
154
0
Order By: Relevance
“…To avoid such pitfalls, algorithms and techniques need to be developed for processing 3D inputs such as RGB-D, meshes, and point clouds in conjunction with language. Some pioneering works have already begun in this direction (Achlioptas et al, 2020;Liu et al, 2021;Roh et al, 2021) and we anticipate the trend 78 to shift more towards developing algorithms for understanding as well as the generation of 3D scenes (Briq et al, 2021), while utilizing language as a main or auxiliary modality.…”
Section: Future Directionsmentioning
confidence: 99%
“…To avoid such pitfalls, algorithms and techniques need to be developed for processing 3D inputs such as RGB-D, meshes, and point clouds in conjunction with language. Some pioneering works have already begun in this direction (Achlioptas et al, 2020;Liu et al, 2021;Roh et al, 2021) and we anticipate the trend 78 to shift more towards developing algorithms for understanding as well as the generation of 3D scenes (Briq et al, 2021), while utilizing language as a main or auxiliary modality.…”
Section: Future Directionsmentioning
confidence: 99%
“…Existing works focus on using language to confine individual objects, e.g., detecting referred 3D objects [7] or distinguishing objects according to language phrases [2]. Recently, ScanRefer [6] and ReferIt3D [1] introduce a task of localizing objects within a 3D scene given the linguistic descriptions, namely 3D visual grounding. Following them, several works are proposed to improve the performance through instance segmentation [14,46], or Transformer [33,44,49].…”
Section: Scene Graph Normalizationmentioning
confidence: 99%
“…ReferIt3D [1]: It is initially a model for the 3D visual grounding task. The network first extracts object features through PointNet++ [31].…”
Section: Vqa-3dmentioning
confidence: 99%
See 2 more Smart Citations