Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection

Gu, Yuchao; Wang, Lijuan; Wang, Ziqin; Liu, Yun; Cheng, Ming; Lu, Shao Ping

doi:10.1609/aaai.v34i07.6718

Cited by 126 publications

(76 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visual comparisons of the proposed method and the state-of-the-art algorithms. From left to right: the input image, ground truth, the saliency maps produced by our proposed method, GT, SCOM [48], SCNN [50], DLVSD [33], FGRN [14], MBNM [51], MST [49], PSCA [52], PDB [11] ,LSTI [53] , RCR [12], SSAV [39]. Our method consistently produces saliency maps closest to the ground truth.…”

Section: ) Visual Comparisonmentioning

confidence: 91%

“…We compare our video saliency detection network with other 14 state-of-the-art models, including MDB [43], MST [49], STBP [32], SFLR [30], SCOM [48], SCNN [50], DLVS [33], FGRN [14], MBNM [51], PDBM [11], RCRNet [12], S-SAV [39], PSCA [52], and LSTI [53]. For fair comparison, we take the code provided by Fan et al [39] to compute these metrics on our video saliency maps.…”

Section: Comparison With the State-of-the-art Methods 1) Quantitatmentioning

confidence: 99%

See 1 more Smart Citation

Cross Complementary Fusion Network for Video Salient Object Detection

Wang

Pan

2020

IEEE Access

View full text Add to dashboard Cite

Recently, optical flow guided video saliency detection methods have achieved high performance. However, the computation cost of optical flow is usually expensive, which limits the applications of these methods in time-critical scenarios. In this paper, we propose an end-to-end cross complementary network (CCNet) based on fully convolutional network for video saliency detection. The CCNet consists of two effective components: single-image representation enhancement (SRE) module and spatiotemporal information learning (STIL) module. The SRE module provides robust saliency feature learning for a single image through a pyramid pooling module followed by a lightweight channel attention module. As an effective alternative operation of optical flow to extract spatiotemporal information, the STIL introduces a spatiotemporal information fusion module and a video correlation filter to learn the spatiotemporal information, the inner collaborative and interactive information between consecutive input groups. In addition to enhancing the feature representation of a single image, the combination of SRE and STIL can learn the spatiotemporal information and the correlation between consecutive images well. Extensive experimental results demonstrate the effectiveness of our method in comparison with 14 state-of-the-art approaches. INDEX TERMS Video saliency detection, pyramid pooling, self-attention mechanism, multi-channel concatenation, structural information.

show abstract

Section: ) Visual Comparisonmentioning

confidence: 91%

Section: Comparison With the State-of-the-art Methods 1) Quantitatmentioning

confidence: 99%

Cross Complementary Fusion Network for Video Salient Object Detection

Wang

Pan

2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…In addition, CBAM [55] considers capturing feature information from spatial and channel attention simultaneously, which significantly improves the feature representation ability. Recently, the nonlocal neural network [56] has been widely used in salient object detection [58], image superresolution [59], etc. Its main purpose is to enhance the features of the current position by aggregating contextual information from other positions and solve the problem that the receptive field of a single convolutional layer is ineffective to cover correlated regions.…”

Section: Attention In Cnnsmentioning

confidence: 99%

An Attention-Guided Multilayer Feature Aggregation Network for Remote Sensing Image Scene Classification

Lei

Tang

et al. 2021

Remote Sensing

View full text Add to dashboard Cite

Remote sensing image scene classification (RSISC) has broad application prospects, but related challenges still exist and urgently need to be addressed. One of the most important challenges is how to learn a strong discriminative scene representation. Recently, convolutional neural networks (CNNs) have shown great potential in RSISC due to their powerful feature learning ability; however, their performance may be restricted by the complexity of remote sensing images, such as spatial layout, varying scales, complex backgrounds, category diversity, etc. In this paper, we propose an attention-guided multilayer feature aggregation network (AGMFA-Net) that attempts to improve the scene classification performance by effectively aggregating features from different layers. Specifically, to reduce the discrepancies between different layers, we employed the channel–spatial attention on multiple high-level convolutional feature maps to capture more accurately semantic regions that correspond to the content of the given scene. Then, we utilized the learned semantic regions as guidance to aggregate the valuable information from multilayer convolutional features, so as to achieve stronger scene features for classification. Experimental results on three remote sensing scene datasets indicated that our approach achieved competitive classification performance in comparison to the baselines and other state-of-the-art methods.

show abstract

“…Among modern Convolutional Neural Networks (ConvNet/CNNs), there are many techniques, e.g., dynamic heads with attentions [1], dual attention [2], self-attention [3] have gained increasing attention due to their capability. Still, all suffer from accuracy performance issues.…”

Section: Introductionmentioning

confidence: 99%

Object Detection of Road Assets Using Transformer-Based YOLOX with Feature Pyramid Decoder on Thai Highway Panorama

Panboonyuen

Thongbai²,

Wongweeranimit

et al. 2021

Information

View full text Add to dashboard Cite

Due to the various sizes of each object, such as kilometer stones, detection is still a challenge, and it directly impacts the accuracy of these object counts. Transformers have demonstrated impressive results in various natural language processing (NLP) and image processing tasks due to long-range modeling dependencies. This paper aims to propose an exceeding you only look once (YOLO) series with two contributions: (i) We propose to employ a pre-training objective to gain the original visual tokens based on the image patches on road asset images. By utilizing pre-training Vision Transformer (ViT) as a backbone, we immediately fine-tune the model weights on downstream tasks by joining task layers upon the pre-trained encoder. (ii) We apply Feature Pyramid Network (FPN) decoder designs to our deep learning network to learn the importance of different input features instead of simply summing up or concatenating, which may cause feature mismatch and performance degradation. Conclusively, our proposed method (Transformer-Based YOLOX with FPN) learns very general representations of objects. It significantly outperforms other state-of-the-art (SOTA) detectors, including YOLOv5S, YOLOv5M, and YOLOv5L. We boosted it to 61.5% AP on the Thailand highway corpus, surpassing the current best practice (YOLOv5L) by 2.56% AP for the test-dev data set.

show abstract

Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection

Cited by 126 publications

References 42 publications

Cross Complementary Fusion Network for Video Salient Object Detection

Cross Complementary Fusion Network for Video Salient Object Detection

An Attention-Guided Multilayer Feature Aggregation Network for Remote Sensing Image Scene Classification

Object Detection of Road Assets Using Transformer-Based YOLOX with Feature Pyramid Decoder on Thai Highway Panorama

Contact Info

Product

Resources

About