TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection

Min, Kyle; Corso, Jason J.

doi:10.1109/iccv.2019.00248

Cited by 129 publications

(119 citation statements)

References 40 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…Earlier attempts to predict saliency typically utilized handcrafted image features, such as color, intensity, and orientation contrast [39], motion contrast [40], and camera motion [41]. Later on, DNN-based semantic-level features were extensively investigated for both image content [42]- [48] and video sequences [49]- [55]. Among these features, image saliency prediction only exploits spatial information, while video saliency prediction often relies on spatial and temporal attributes jointly.…”

Section: A Saliency-based Video Preprocessingmentioning

confidence: 99%

Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies

Ding

Chen

et al. 2021

Proc. IEEE

View full text Add to dashboard Cite

Significant advances in video compression systems have been made in the past several decades to satisfy the near-exponential growth of Internet-scale video traffic. From the application perspective, we have identified three major functional blocks, including preprocessing, coding, and postprocessing, which have been continuously investigated to maximize the end-user quality of experience (QoE) under a limited bit rate budget. Recently, artificial intelligence (AI)-powered techniques have shown great potential to further increase the efficiency of the aforementioned functional blocks, both individually and jointly. In this article, we review recent technical advances in video compression systems extensively, with an emphasis on deep neural network (DNN)based approaches, and then present three comprehensive case studies. On preprocessing, we show a switchable texturebased video coding example that leverages DNN-based scene understanding to extract semantic areas for the improvement Manuscript

show abstract

Section: A Saliency-based Video Preprocessingmentioning

confidence: 99%

Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies

Ding

Chen

et al. 2021

Proc. IEEE

View full text Add to dashboard Cite

show abstract

“…The other is the ineffective structures in extracting joint spatio-temporal FIGURE 1. Cases that TASED-Net [16] fails to allocate saliency and predict shifts of saliency focus. The rows from top to bottom are the input, ground-truth, and saliency maps predicted by TASED-Net [16].…”

Section: Introductionmentioning

confidence: 99%

“…Deep models based on LSTM or its variants [7]- [15] are difficult to train for long-range temporal evolution due to the frame-by-frame data feeding scheme. Models based on 3D fully CNN (3DCNN) [16], [17] still may fail to allocate saliency or to predict shifts of saliency focus when there are multiple instances.…”

Section: Introductionmentioning

confidence: 99%

“…Taking TASED-Net [16] as an example, which is based on 3DCNN and achieves the state-of-the-art performance, it still may fail in both cases mentioned above, as shown in Fig. 1.…”

Section: Introductionmentioning

confidence: 99%

“…In the first frame, a woman running towards camera attracts most attention according to ground-truth. However, TASED-Net [16] allocates nearly equal saliency to the woman and a parked car. The other four frames show a shift of saliency focus from a standing veteran to a young man.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction

Shi

2021

IEEE Access

View full text Add to dashboard Cite

As human beings are living in an always changing environment, predicting saliency maps from dynamic visual stimulus is of importance for modeling human visual system. Compared with human behavior, recent models based on LSTM and 3DCNN are still not good enough due to the limitation in spatio-temporal feature representation. In this paper, a novel 3D convolutional encoder-decoder architecture is proposed for saliency prediction on dynamic scenes. The encoder consists of two subnetworks to extract both spatial and temporal features in parallel with intermediate fusion, respectively. The saliency map is produced in decoder by firstly enlarging features in spatial dimensions and then aggregating temporal information. Specially designed structures can transfer pooling indices from encoder to decoder, which helps the generation of location-aware saliency maps. The proposed network can be trained and inferred in an end-to-end manner. Experimental results on benchmark DHF1K show that the proposed model achieves the state-of-the-art performance on key metrics including both normalized scanpath saliency and Pearson's correlation coefficient.

show abstract

Unified Image and Video Saliency Modeling

Droste

Jiao

Noble

2020

Lecture Notes in Computer Science

117

View full text Add to dashboard Cite

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. On the one hand, image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing. For video saliency prediction on the other hand, rapid gains have been achieved on the recent DHF1K benchmark through network architectures that are optimized for this task. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We find that it is crucial to model the domain shift between image and video saliency data and between different video saliency datasets for effective joint modeling. We identify different sources of domain shift and address them through four novel domain adaptation techniques-Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN-in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train the entire network simultaneously with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, as well as the image saliency datasets SALICON and MIT300. With one set of parameters, our method achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency prediction, despite a 5 to 20-fold reduction in model size and the fastest runtime among all competing deep models. We provide retrospective analyses and ablation studies which demonstrate the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.

show abstract

TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection

Cited by 129 publications

References 40 publications

Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies

Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies

A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction

Unified Image and Video Saliency Modeling

Contact Info

Product

Resources

About