X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Yuan, Zhihao; Xu, Yan; Liao, Yinghong; Guo, Yufeng; Li, Guanbin; Li, Zhen; Cui, Shuguang

doi:10.1109/cvpr52688.2022.00837

Cited by 37 publications

(26 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, 3DJCG and D3Net focus on the joint promotion of 3D dense captioning and 3D visual grounding, therefore their reported models are trained with data from both tasks. Among methods listed under SCST, χ-Trans2Cap [38] combines MLE training with standard SCST in an additive manner, Scan2Cap and D3Net [7] adopt the same reward combining CIDEr score and listener losses with a weighted sum. It's worth mentioning that our model adopts the standard SCST, whose reward function is CIDEr score.…”

Section: Comparison With Existing Methodsmentioning

confidence: 99%

“…3DJCG [4] and D3Net [7] study the joint promotion of 3D dense captioning and 3D visual grounding. χ-Trans2Cap [38] introduces additional 2D prior to complement information for 3D dense captioning with knowledge transfer. Recently, [42] shifts attention to contextual information for the perception of non-object information.…”

Section: Related Workmentioning

confidence: 99%

“…3D dense captioning [11,7,38,36,18,4] requires a system to localize all the objects in a 3D scene, and generate descriptive sentences for each object. This problem is challenging given 1) the sparsity of point clouds and 2) the cluttered distribution of objects.…”

Section: Introductionmentioning

confidence: 99%

“…3DJCG [4] and D3Net [7] study the correlation between 3D visual grounding and 3D dense captioning, and point out that these two tasks promote each other. Additionally, χ-Trans2Cap [38] discusses how to transfer knowledge from additional 2d information to boost 3d dense captioning.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Chen¹,

Zhu²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated "detect-then-describe" pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular DEtection TRansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the twostage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13% and 7.11% in CIDEr@0.5IoU, respectively. Codes will be released soon.

show abstract

Section: Comparison With Existing Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Chen¹,

Zhu²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…2D Semantics Assisted Training for 3D Visual Grounding [17] utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding. X-Trans2Cap [18] for 3D dense captioning task, using teacher-student framework with cross-modal fusion for more efficacious knowledge transfer. These approaches allow teacher model using the additional information provided by other modalities only in the training phase, and afterwards transferring the knowledge from teacher model to student model through knowledge distillation with the input information of 3D modalities only, therefore avoiding the additional inputs in inference phase, allowing to keep the same input on the test set as the original setting, without limiting the application occasions.…”

Section: Multi Modal Knowledge Distillationmentioning

confidence: 99%

Multi-modal assisted knowledge distillation for 3D question answering

2023

Sixth International Conference on Computer Information Science and Application Technology (CISAT 2023)

View full text Add to dashboard Cite

3D question answering (3D-QA) aims to answer free-form nature language questions given 3D scenes represented by point clouds. Compared to traditional 2D-QA, 3D-QA poses a dual challenge for models by assessing their understanding of both object appearance and structure, along with their spatial relationships. In this work, we introduce a novel method, named M2AD, that leverages multi-modal data to enhance the representation of 3D scene point clouds during the training phase. Specifically, we augment the capabilities of the model by incorporating 2D features corresponding to 3D objects and captions corresponding to the scene into the 3D object proposal stage, thereby endowing it with stronger representation abilities. Furthermore, to ensure self-reliance during inference without the need for additional data, we adopt a teacher-student framework to distill the enhanced model's knowledge to a model solely utilizing point cloud data. Extensive experimentation substantiates the effectiveness of our proposed model.

show abstract