Residual Conv-Deconv Grid Network for Semantic Segmentation

Fourure, Damien; Emonet, Rémi; Fromont, Élisa; Muselet, Damien; Trémeau, Alain; Wolf, Christian

doi:10.5244/c.31.181

Cited by 187 publications

(108 citation statements)

References 17 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…Maintaining high-resolution representations. Our work is closely related to several works that can also generate highresolution representations, e.g., convolutional neural fabrics [98], interlinked CNNs [150], GridNet [29], and multiscale DenseNet [43].…”

Section: Related Workmentioning

confidence: 98%

“…The two early works, convolutional neural fabrics [98] and interlinked CNNs [150], lack careful design on when to start low-resolution parallel streams, and how and where to exchange information across parallel streams, and do not use batch normalization and residual connections, thus not showing satisfactory performance. GridNet [29] is like a combination of multiple U-Nets and includes two symmetric information exchange stages: the first stage passes information only from high resolution to low resolution, and the second stage passes information only from low resolution to high resolution. This limits its segmentation quality.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,582

1,479

View full text Add to dashboard Cite

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet. ! 1 INTRODUCTION D EEP convolutional neural networks (DCNNs) have achieved state-of-the-art results in many computer vision tasks, such as image classification, object detection, semantic segmentation, human pose estimation, and so on. The strength is that DCNNs are able to learn richer representations than conventional hand-crafted representations. Most recently-developed classification networks, including AlexNet [59], VGGNet [101], GoogleNet [108], ResNet [39], etc., follow the design rule of LeNet-5 [61]. This is depicted in Figure 1 (a): gradually reduce the spatial size of the feature maps, connect the convolutions from high resolution to low resolution in series, and lead to a low-resolution representation, which is further processed for classification.High-resolution representations are needed for positionsensitive tasks, e.g., semantic segmentation, human pose estimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery process to raise the representation resolution from the low-resolution representation outputted by a classification or classification-like network as depicted in Figure 1 (b), e.g., Hourglass [83], Seg-Net [3], DeconvNet [85], U-Net [95], SimpleBaseline [124], and encoder-decoder [90]. In addition, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution representations [15], [144].We present a novel architecture, namely High-Resolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a highresolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network • J. Wang is with Microsoft Research,

show abstract

Section: Related Workmentioning

confidence: 98%

Section: Related Workmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,582

1,479

View full text Add to dashboard Cite

show abstract

“…The exploration of aggregating hierarchical feature has recently been the subject of research. Fourure et al [45] propose GridNet, which is an encoder-decoder architecture wherein the feature maps are wired in a grid fashion, generalizing several classical segmentation architectures. Despite GridNet contains multiple streams with different resolutions, it lacks up-sampling layers between skip connections; and thus, it does not represent UNet++.…”

Section: B Feature Aggregationmentioning

confidence: 99%

UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation

Zhou

Siddiquee

Tajbakhsh

et al. 2020

IEEE Trans. Med. Imaging

2,316

1,038

View full text Add to dashboard Cite

The state-of-the-art models for medical image segmentation are variants of U-Net and fully convolutional networks (FCN). Despite their success, these models have two limitations:(1) their optimal depth is apriori unknown, requiring extensive architecture search or inefficient ensemble of models of varying depths; and (2) their skip connections impose an unnecessarily restrictive fusion scheme, forcing aggregation only at the samescale feature maps of the encoder and decoder sub-networks. To overcome these two limitations, we propose UNet++, a new neural architecture for semantic and instance segmentation, by (1) alleviating the unknown network depth with an efficient ensemble of U-Nets of varying depths, which partially share an encoder and co-learn simultaneously using deep supervision; (2) redesigning skip connections to aggregate features of varying semantic scales at the decoder sub-networks, leading to a highly flexible feature fusion scheme; and (3) devising a pruning scheme to accelerate the inference speed of UNet++. We have evaluated UNet++ using six different medical image segmentation datasets, covering multiple imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and electron microscopy (EM), and demonstrating that (1) UNet++ consistently outperforms the baseline models for the task of semantic segmentation across different datasets and backbone architectures; (2) UNet++ enhances segmentation quality of varying-size objects-an improvement over the fixed-depth U-Net; (3) Mask RCNN++ (Mask R-CNN with UNet++ design) outperforms the original Mask R-CNN for the task of instance segmentation; and (4) pruned UNet++ models achieve significant speedup while showing only modest performance degradation. Our implementation and pre-trained models are available at https://github.com/MrGiovanni/UNetPlusPlus.

show abstract

“…3. Attention-based multi-scale estimation: Inspired by [7], we implement multi-scale estimation on a grid network. The grid network has clear advantages over the encoderdecoder network and the conventional multi-scale network extensively used in image restoration [18,41,38,27].…”

Section: Trainable Pre-processing Modulementioning

confidence: 99%

GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing

Liu

Shi

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

632

393

View full text Add to dashboard Cite

We propose an end-to-end trainable Convolutional Neural Network (CNN), named GridDehazeNet, for single image dehazing. The GridDehazeNet consists of three modules: pre-processing, backbone, and post-processing. The trainable pre-processing module can generate learned inputs with better diversity and more pertinent features as compared to those derived inputs produced by handselected pre-processing methods. The backbone module implements a novel attention-based multi-scale estimation on a grid network, which can effectively alleviate the bottleneck issue often encountered in the conventional multi-scale approach. The post-processing module helps to reduce the artifacts in the final output. Experimental results indicate that the GridDehazeNet outperforms the state-of-the-arts on both synthetic and real-world images. The proposed hazing method does not rely on the atmosphere scattering model, and we provide an explanation as to why it is not necessarily beneficial to take advantage of the dimension reduction offered by the atmosphere scattering model for image dehazing, even if only the dehazing results on synthetic images are concerned. Project website: https: //proteus1991.github.io/GridDehazeNet/.

show abstract

Residual Conv-Deconv Grid Network for Semantic Segmentation

Cited by 187 publications

References 17 publications

Deep High-Resolution Representation Learning for Visual Recognition

Deep High-Resolution Representation Learning for Visual Recognition

UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation

GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing

Contact Info

Product

Resources

About