2021
DOI: 10.48550/arxiv.2104.09224
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

Abstract: How should representations from complementary sensors be integrated for autonomous driving? Geometrybased sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting. However, for the actual driving task, the global context of the 3D scene is key, e.g. a change in traffic light state can affect the behavior of a vehicle geometrically distant from that traffic light. Geometry alone may therefore be insufficient for effectively fusing representations in end-to-end d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 45 publications
(83 reference statements)
0
2
0
Order By: Relevance
“…Most state-of-the-art methods [23,34,41,19,25] for RGB-Depth semantic segmentation leveraged an RGB encoder and a depth encoder respectively, and then applied middle fusion between them to better incorporate low-level (texture, geometry) and high-level (semantic) features. In the field of autonomous driving, there have also been great efforts on fusing camera frames and LiDAR scans for better object detection [29,28,37].…”
Section: Related Workmentioning
confidence: 99%
“…Most state-of-the-art methods [23,34,41,19,25] for RGB-Depth semantic segmentation leveraged an RGB encoder and a depth encoder respectively, and then applied middle fusion between them to better incorporate low-level (texture, geometry) and high-level (semantic) features. In the field of autonomous driving, there have also been great efforts on fusing camera frames and LiDAR scans for better object detection [29,28,37].…”
Section: Related Workmentioning
confidence: 99%
“…To be specific, all the related operations in the Transformer network are order-independent and parallelizable. With the rapid development of the Transformer network, it has been widely applied and demonstrated its outstanding performance in many fields, such as image processing, [13], pose recognition, [14], autonomous driving, [15], and natural language processing, [16], [17]. Moreover, owing to the interaction and coupling effects among different components and subsystems of the gearbox, the measured vibration signals collected from sensors installed on the house usually contain multiple intrinsic oscillatory modes, [18].…”
Section: Introductionmentioning
confidence: 99%