Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion

Valada, Abhinav; Oliveira, Gabriel L.; Brox, Thomas; Burgard, Wolfram

doi:10.1007/978-3-319-50115-4_41

Cited by 125 publications

(110 citation statements)

References 8 publications

Supporting

Mentioning

109

Contrasting

Unclassified

Order By: Relevance

“…Finally, we present extensive experimental evaluations of our proposed unimodal and multimodal architectures on benchmark scene understanding datasets including Cityscapes (Cordts et al, 2016), Synthia (Ros et al, 2016), SUN RGB-D (Song et al, 2015), ScanNet (Dai et al, 2017) and Freiburg Forest (Valada et al, 2016b). The results demonstrate that our model sets the new state-of-the-art on all these benchmarks considering the computational efficiency and the fast inference time of 72ms on a consumer grade GPU.…”

Section: Introductionmentioning

confidence: 91%

“…In the late fusion approach, identical network streams are first trained individually on a specific modality and the feature maps are fused towards the end of network using concatenation (Eitel et al, 2015) or element-wise summation (Valada et al, 2016b), followed by learning deeper fused representations. However, this does not enable the network to adapt the fusion to changing scene context.…”

Section: Related Workmentioning

confidence: 99%

“…densities, we incorporate our recently proposed multiscale residual units (Valada et al, 2017) at varying dilation rates in the last two blocks of the encoder. In addition, to enable our model to capture long-range context and to further learn multiscale representations, we propose an efficient variant of the atrous spatial pyramid pooling module known as eASPP which has a larger effective receptive field and reduces the number of parameters required by over 87% compared to the originally proposed ASPP in DeepLab v3 .…”

Section: Adapnet++ Architecturementioning

confidence: 99%

“…A naive approach to compute the feature responses at the full image resolution would be to remove the downsampling and replace all the convolutions to atrous convolutions having a dilation rate r ≥ 2 but this would be both computation and memory intensive. Therefore, we propose a novel multiscale residual unit (Valada et al, 2017) to efficiently enlarge the receptive field and aggregate multiscale features without increasing the number of parameters and the computational burden. Specifically, we replace the 3 × 3 convolution in the full pre-activation residual unit with two parallel 3 × 3 atrous convolutions with different dilation rates and half the num- Figure 3 The proposed encoder is built upon the full pre-activation ResNet-50 architecture.…”

Section: Encodermentioning

confidence: 99%

See 3 more Smart Citations

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

2019

Self Cite

View full text Add to dashboard Cite

Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the realworld. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modalityspecific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed self-supervised model adaptation fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. In addition, we propose a computationally efficient unimodal segmentation architecture termed AdapNet++ that incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling that has a lar-Abhinav Valada ger effective receptive field with more than 10× fewer parameters, complemented with a strong decoder with a multiresolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest benchmarks demonstrate that both our unimodal and multimodal architectures achieve state-of-the-art performance while simultaneously being efficient in terms of parameters and inference time as well as demonstrating substantial robustness in adverse perceptual conditions.

show abstract

Section: Introductionmentioning

confidence: 91%

Section: Related Workmentioning

confidence: 99%

Section: Adapnet++ Architecturementioning

confidence: 99%

Section: Encodermentioning

confidence: 99%

See 2 more Smart Citations

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…According to several researches [27,17], methods based on multiple encoders have better capability to capture complementary and cross-modal interdependent features. Therefore, our proposed framework is based on multi-encoderbased method.…”

Section: Multi-modal Fusionmentioning

confidence: 99%

MMFNet: A multi-modality MRI fusion network for segmentation of nasopharyngeal carcinoma

Huai

Yin

et al. 2020

Neurocomputing

View full text Add to dashboard Cite

Segmentation of nasopharyngeal carcinoma (NPC) from Magnetic ResonanceImages (MRI) is a crucial prerequisite for NPC radiotherapy. However, manually segmenting of NPC is time-consuming and labor-intensive. Additionally, single-modality MRI generally cannot provide enough information for its accurate delineation. Therefore, a multi-modality MRI fusion network (MMFNet), which is a novel framework to fuse information from multi-modality medical images, is proposed to utilize MRI of T1, T2 and contrast-enhanced T1 to complete accurate segmentation of NPC. The backbone of MMFNet is designed as a multi-encoder-based network, consisting of several encoders to capture modality-specific features and one decoder to obtain fused features for NPC segmentation. A fusion block is presented to effectively fuse multi-source features.It contains a 3D Convolutional Block Attention Module (3D-CBAM), recalibrating low-level features captured from modality-specific encoders to highlight both informative features and regions of interest (ROIs), and a residual fusion block (RFBlock), which fuses re-weighted features to keep balance between fused ones and high-level features from decoder. Moreover, in order to make full mining of individual information from multi-modality MRI, a training strategy named self- * transfer is proposed to utilize pre-trained modality-specific encoders to initialize multi-encoder-based network. The proposed method based on multi-modality MRI can effectively segment NPC and its advantages are validated by extensive experiments.

show abstract

Multi-scale Autoencoders in Autoencoder for Semantic Image Segmentation

Yusiong

Naval

2019

Intelligent Information and Database Systems

View full text Add to dashboard Cite

Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion

Cited by 125 publications

References 8 publications

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

MMFNet: A multi-modality MRI fusion network for segmentation of nasopharyngeal carcinoma

Multi-scale Autoencoders in Autoencoder for Semantic Image Segmentation

Contact Info

Product

Resources

About