EMA-VIO: Deep Visual–Inertial Odometry With External Memory Attention

Tu, Zheming; Chen, Changhao; Pan, Xianfei; Liu, Ruochen; Cui, Jiarui; Mao, Jun

doi:10.1109/jsen.2022.3208200

Cited by 13 publications

(15 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with VINS-Mono [23], the advantage of our proposed approach not only exists in fusion stage but also in front-end feature extraction which we mentioned in Section II.A. Besides, the improvement compared with EMA-VIO [1] which also deploys Transformer-based approach for fusion, is possibly that the multi-layer fusion module aggregates the LiDAR and inertial data at different scale [32] [31].…”

Section: Positioning Results On Kitti Datasetmentioning

confidence: 99%

“…Inspired by ViLT [28], the Transformer [15] architecture has shown impressive performance in the field of multi-modal fusion, not limited to odometry estimation tasks but also in navigation [29], semantic segmentation, and object detection tasks [30]. In EMA-VIO [1] and AFT-VO [2], the Transformer architecture is used to fuse multiple modalities, and through challenging real-world experiments, it has shown higher accuracy and robustness than some soft mask-based approaches. But these works did not consider the effect of fusion position, and the Transformer is used as a black box without interpretability to explain how two modalities interact and fusion inside the Transformer architecture.…”

Section: Learning-based Multi-modal Fusion For Odometry Estimationmentioning

confidence: 99%

“…2) IMU signal image: The mainstream approach to process IMU data is to use the recurrent neural network LSTM, which is good at modeling sequential data, as proposed in Son et al [4], UnDeepLIO [38], DeepLIO [5], Li et al [8], Chen et al [3], etc. However, LSTM has limitations in parallel computation [1] compared with CNN-based networks. Additionally, [41] explained that CNN-based methods are more robust than LSTM and require less time to learn the model.…”

Section: A Lidar and Imu Data Pre-processingmentioning

confidence: 99%

“…For optimization-based solutions, it is difficult to design adaptive weights, such as a covariance matrix, to balance each residual of the sensor [11]. Recently, numerous studies have confirmed the superiority of data-driven learning-based solutions over traditional solutions in odometry estimation, such as those proposed by [1], [7], [5], and [12].…”

Section: Introductionmentioning

confidence: 99%

“…3) A synthetic multi-modal dataset is publicly available 1 , which is used to evaluate the generalization ability of the proposed fusion strategy with different combinations of modalities. This dataset also enables easy testing of other fusion algorithms within the community.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

CertainOdom: Uncertainty Weighted Multi-task Learning Model for LiDAR Odometry Estimation

Sun

Ding

Yoshiyasu

et al. 2022

2022 IEEE International Conference on Robotics and Biomimetics (ROBIO)

View full text Add to dashboard Cite

Multi-modal fusion of sensors is a commonly used approach to enhance the performance of odometry estimation, which is also a fundamental module for mobile robots. However, the question of how to perform fusion among different modalities in a supervised sensor fusion odometry estimation task? is still one of challenging issues remains. Some simple operations, such as element-wise summation and concatenation, are not capable of assigning adaptive attentional weights to incorporate different modalities efficiently, which make it difficult to achieve competitive odometry results. Recently, the Transformer architecture has shown potential for multi-modal fusion tasks, particularly in the domains of vision with language. In this work, we propose an end-to-end supervised Transformer-based LiDAR-Inertial fusion framework (namely TransFusionOdom) for odometry estimation. The multi-attention fusion module demonstrates different fusion approaches for homogeneous and heterogeneous modalities to address the overfitting problem that can arise from blindly increasing the complexity of the model. Additionally, to interpret the learning process of the Transformer-based multi-modal interactions, a general visualization approach is introduced to illustrate the interactions between modalities. Moreover, exhaustive ablation studies evaluate different multi-modal fusion strategies to verify the performance of the proposed fusion strategy. A synthetic multi-modal dataset is made public to validate the generalization ability of the proposed fusion strategy, which also works for other combinations of different modalities. The quantitative and qualitative odometry evaluations on the KITTI dataset verify the proposed TransFusionOdom could achieve superior performance compared with other related works.

show abstract

Section: Positioning Results On Kitti Datasetmentioning

confidence: 99%