Cross-Modality Fusion Transformer for Multispectral Object Detection

Fang, Qingyun; Han, Dapeng; Wang, Zhao-Kui

doi:10.2139/ssrn.4227745

Cited by 21 publications

(14 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MS-COCO is a large-scale visible light image dataset provided by Microsoft that contains more than 200,000 images and covers a variety of real-life scenarios [54]. MS FLIR is a multimodal image dataset provided by the Teledyne FLIR Company; it contains more than 8000 infrared-visible image pairs, which mainly focus on cars and driving scenes [55]. Specifically, we first observe that the VFIS model improves the RMSE of the UDHN model by 1.8% on the MS-COCO visible light image dataset.…”

Section: Overall Resultsmentioning

confidence: 99%

Cross-Modal Image Registration via Rasterized Parameter Prediction for Object Tracking

Zhang

Xiang

2023

Applied Sciences

View full text Add to dashboard Cite

Object tracking requires heterogeneous images that are well registered in advance, with cross-modal image registration used to transform images of the same scene generated by different sensors into the same coordinate system. Infrared and visible light sensors are the most widely used in environmental perception; however, misaligned pixel coordinates in cross-modal images remain a challenge in practical applications of the object tracking task. Traditional feature-based approaches can only be applied in single-mode scenarios, and cannot be well extended to cross-modal scenarios. Recent deep learning technology employs neural networks with large parameter scales for prediction of feature points for image registration. However, supervised learning methods require numerous manually aligned images for model training, leading to the scalability and adaptivity problems. The Unsupervised Deep Homography Network (UDHN) applies Mean Absolute Error (MAE) metrics for cost function computation without labelled images; however, it is currently inapplicable for cross-modal image registration. In this paper, we propose aligning infrared and visible images using a rasterized parameter prediction algorithm with similarity measurement evaluation. Specifically, we use Cost Volume (CV) to predict registration parameters from coarse-grained to fine-grained layers with a raster constraint for multimodal feature fusion. In addition, motivated by the utilization of mutual information in contrastive learning, we apply a cross-modal similarity measurement algorithm for semi-supervised image registration. Our proposed method achieves state-of-the-art performance on the MS-COCO and FLIR datasets.

show abstract

Section: Overall Resultsmentioning

confidence: 99%

Cross-Modal Image Registration via Rasterized Parameter Prediction for Object Tracking

Zhang

Xiang

2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…To build vectorized semantic HD map, HDMapNet [18] follows a segmentation-then-vectorization paradigm. To achieve end-to-end learning [7,40,11], VectorMapNet [25] adopts a coarse-to-fine two-stage pipeline and utilizes an auto-regressive decoder to predict points sequentially. MapTR [21] proposes unified permutation-equivalent modeling to exploit the undirected nature of semantic HD map and designs a parallel end-to-end framework.…”

Section: Related Workmentioning

confidence: 99%

Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction

Liao¹,

Chen²,

Jiang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Online lane graph construction is a promising but challenging task in autonomous driving. Previous methods usually model the lane graph at the pixel or piece level, and recover the lane graph by pixel-wise or piece-wise connection, which breaks down the continuity of the lane. Human drivers focus on and drive along the continuous and complete paths instead of considering lane pieces. Autonomous vehicles also require path-specific guidance from lane graph for trajectory planning. We argue that the path, which indicates the traffic flow, is the primitive of the lane graph. Motivated by this, we propose to model the lane graph in a novel path-wise manner, which well preserves the continuity of the lane and encodes traffic information for planning. We present a path-based online lane graph construction method, termed LaneGAP, which endto-end learns the path and recovers the lane graph via a Path2Graph algorithm. We qualitatively and quantitatively demonstrate the superiority of LaneGAP over conventional pixel-based and piece-based methods. Abundant visualizations show LaneGAP can cope with diverse traffic conditions. Code and models will be released for facilitating future research.

show abstract

“…Besides, new technologies such as Transformers are being applied to the integration of multi-modal features. Fang et al [50] proposed a model where the fusion module is placed in the backbone network layer for feature extraction. They designed a transformed-based fusion block, named Cross-Modality Fusion Transformer(CFT), to enhance the representations capability of two-stream CNNs in multi-spectral object detection.…”

Section: Multi-modalmentioning

confidence: 99%

Multi‐modal object detection via transformer network

Liu

Wang

Gao

et al. 2023

IET Image Processing

View full text Add to dashboard Cite

According to the fact that single‐modal data usually contain limited information, a great deal of effort has been devoted to making use of the complementary information contained in the multi‐modal data on various patterns. Thus, this paper is concerned with an object detection method that can fully utilize multi‐modal data. First, the method introduces the transformer mechanism to realize the fusion of intra‐modal and inter‐modal features of different modal data. The aim is to take advantage of the complementarity of data between modalities, which helps to improve the performance of multi‐modal object detection. Second, a contrastive loss suitable for contrastive learning is applied. This enables the authors to effectively utilize label information. Extensive experiments are conducted on multiple object detection datasets to demonstrate the effectiveness of our proposed method.

show abstract

Cross-Modality Fusion Transformer for Multispectral Object Detection

Cited by 21 publications

References 22 publications

Cross-Modal Image Registration via Rasterized Parameter Prediction for Object Tracking

Cross-Modal Image Registration via Rasterized Parameter Prediction for Object Tracking

Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction

Multi‐modal object detection via transformer network

Contact Info

Product

Resources

About