2022
DOI: 10.2139/ssrn.4227745
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Modality Fusion Transformer for Multispectral Object Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(14 citation statements)
references
References 22 publications
0
10
0
Order By: Relevance
“…MS-COCO is a large-scale visible light image dataset provided by Microsoft that contains more than 200,000 images and covers a variety of real-life scenarios [54]. MS FLIR is a multimodal image dataset provided by the Teledyne FLIR Company; it contains more than 8000 infrared-visible image pairs, which mainly focus on cars and driving scenes [55]. Specifically, we first observe that the VFIS model improves the RMSE of the UDHN model by 1.8% on the MS-COCO visible light image dataset.…”
Section: Overall Resultsmentioning
confidence: 99%
“…MS-COCO is a large-scale visible light image dataset provided by Microsoft that contains more than 200,000 images and covers a variety of real-life scenarios [54]. MS FLIR is a multimodal image dataset provided by the Teledyne FLIR Company; it contains more than 8000 infrared-visible image pairs, which mainly focus on cars and driving scenes [55]. Specifically, we first observe that the VFIS model improves the RMSE of the UDHN model by 1.8% on the MS-COCO visible light image dataset.…”
Section: Overall Resultsmentioning
confidence: 99%
“…To build vectorized semantic HD map, HDMapNet [18] follows a segmentation-then-vectorization paradigm. To achieve end-to-end learning [7,40,11], VectorMapNet [25] adopts a coarse-to-fine two-stage pipeline and utilizes an auto-regressive decoder to predict points sequentially. MapTR [21] proposes unified permutation-equivalent modeling to exploit the undirected nature of semantic HD map and designs a parallel end-to-end framework.…”
Section: Related Workmentioning
confidence: 99%
“…Besides, new technologies such as Transformers are being applied to the integration of multi-modal features. Fang et al [50] proposed a model where the fusion module is placed in the backbone network layer for feature extraction. They designed a transformed-based fusion block, named Cross-Modality Fusion Transformer(CFT), to enhance the representations capability of two-stream CNNs in multi-spectral object detection.…”
Section: Multi-modalmentioning
confidence: 99%