2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00837
|View full text |Cite
|
Sign up to set email alerts
|

X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 37 publications
(26 citation statements)
references
References 31 publications
0
26
0
Order By: Relevance
“…Additionally, 3DJCG and D3Net focus on the joint promotion of 3D dense captioning and 3D visual grounding, therefore their reported models are trained with data from both tasks. Among methods listed under SCST, χ-Trans2Cap [38] combines MLE training with standard SCST in an additive manner, Scan2Cap and D3Net [7] adopt the same reward combining CIDEr score and listener losses with a weighted sum. It's worth mentioning that our model adopts the standard SCST, whose reward function is CIDEr score.…”
Section: Comparison With Existing Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Additionally, 3DJCG and D3Net focus on the joint promotion of 3D dense captioning and 3D visual grounding, therefore their reported models are trained with data from both tasks. Among methods listed under SCST, χ-Trans2Cap [38] combines MLE training with standard SCST in an additive manner, Scan2Cap and D3Net [7] adopt the same reward combining CIDEr score and listener losses with a weighted sum. It's worth mentioning that our model adopts the standard SCST, whose reward function is CIDEr score.…”
Section: Comparison With Existing Methodsmentioning
confidence: 99%
“…3DJCG [4] and D3Net [7] study the joint promotion of 3D dense captioning and 3D visual grounding. χ-Trans2Cap [38] introduces additional 2D prior to complement information for 3D dense captioning with knowledge transfer. Recently, [42] shifts attention to contextual information for the perception of non-object information.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…2D Semantics Assisted Training for 3D Visual Grounding [17] utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding. X-Trans2Cap [18] for 3D dense captioning task, using teacher-student framework with cross-modal fusion for more efficacious knowledge transfer. These approaches allow teacher model using the additional information provided by other modalities only in the training phase, and afterwards transferring the knowledge from teacher model to student model through knowledge distillation with the input information of 3D modalities only, therefore avoiding the additional inputs in inference phase, allowing to keep the same input on the test set as the original setting, without limiting the application occasions.…”
Section: Multi Modal Knowledge Distillationmentioning
confidence: 99%