Cross-Modal Self-Attention Network for Referring Image Segmentation

Ye, Linwei; Rochan, Mrigank; Li, Zhi; Wang, Yang

doi:10.1109/cvpr.2019.01075

Cited by 340 publications

(242 citation statements)

References 24 publications

Supporting

Mentioning

237

Contrasting

Order By: Relevance

“…At first, Transformers showed great performances in NLP tasks. Then, Transformers were applied in computer vision tasks such as video processing [69], image super-resolution [15], object detection [13] and segmentation [70], and image classification [71] thanks to their excellent performance.…”

Section: Methods Based On Transformersmentioning

confidence: 99%

Wildfire Segmentation Using Deep Vision Transformers

Ghali

Akhloufi

Jmal³

et al. 2021

Remote Sensing

View full text Add to dashboard Cite

In this paper, we address the problem of forest fires’ early detection and segmentation in order to predict their spread and help with fire fighting. Techniques based on Convolutional Networks are the most used and have proven to be efficient at solving such a problem. However, they remain limited in modeling the long-range relationship between objects in the image, due to the intrinsic locality of convolution operators. In order to overcome this drawback, Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures. They have recently been used to determine the global dependencies between input and output sequences using the self-attention mechanism. In this context, we present in this work the very first study, which explores the potential of vision Transformers in the context of forest fire segmentation. Two vision-based Transformers are used, TransUNet and MedT. Thus, we design two frameworks based on the former image Transformers adapted to our complex, non-structured environment, which we evaluate using varying backbones and we optimize for forest fires’ segmentation. Extensive evaluations of both frameworks revealed a performance superior to current methods. The proposed approaches achieved a state-of-the-art performance with an F1-score of 97.7% for TransUNet architecture and 96.0% for MedT architecture. The analysis of the results showed that these models reduce fire pixels mis-classifications thanks to the extraction of both global and local features, which provide finer detection of the fire’s shape.

show abstract

Section: Methods Based On Transformersmentioning

confidence: 99%

Wildfire Segmentation Using Deep Vision Transformers

Ghali

Akhloufi

Jmal³

et al. 2021

Remote Sensing

View full text Add to dashboard Cite

show abstract

“…The main idea of self-attention is to help convolutions throughout the image domain to capture long-range, full-level interconnections. The network implemented with a self-attention module can help to determine images with small details that are connected with fine details in different areas of the image at each position [20][21][22].…”

Section: Self-attention (Sa) Modulementioning

confidence: 99%

Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention

2020

View full text Add to dashboard Cite

Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN.

show abstract

“…Multi-scale context modeling has verified its effectiveness on boosting the segmentation accuracy of objects in semantic segmentation [11,12,13,14]. Recent works also have shown that the performance of RIS can be further improved through aggregating long-range context from concatenated visual and linguistic features [15] with self-attention [16], or collecting multi-scale context from fused multi-model features [10,17] with atrous spatial pyramid pooling (ASPP) [11,18]. However, the former is high memory cost for computing the affinity map and may introduce redundant features, which are harmful to distinguish the referred object.…”

Section: Introductionmentioning

confidence: 96%

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

Yang

Huang

et al. 2020

Pattern Recognition and Computer Vision

View full text Add to dashboard Cite

In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multiscale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multiscale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.

show abstract

Cross-Modal Self-Attention Network for Referring Image Segmentation

Cited by 340 publications

References 24 publications

Wildfire Segmentation Using Deep Vision Transformers

Wildfire Segmentation Using Deep Vision Transformers

Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

Contact Info

Product

Resources

About