2020
DOI: 10.1109/tip.2019.2936112
|View full text |Cite
|
Sign up to set email alerts
|

Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
73
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 123 publications
(74 citation statements)
references
References 69 publications
0
73
0
Order By: Relevance
“…[6] used attention mechanisms to encode static saliency information and apply LSTMs to learn temporal saliency representation across consecutive video frames. [3] put forward a composite attention mechanism that learned multi-scale local attentions and global attention priors for enhancing spatio-temporal features. Fig.…”
Section: Discussionmentioning
confidence: 99%
“…[6] used attention mechanisms to encode static saliency information and apply LSTMs to learn temporal saliency representation across consecutive video frames. [3] put forward a composite attention mechanism that learned multi-scale local attentions and global attention priors for enhancing spatio-temporal features. Fig.…”
Section: Discussionmentioning
confidence: 99%
“…The OM-CNN consists of two branches: one extracts spatial features of objects based on YOLO [39], the other extracts motion information with a variation of FlowNet [40]. Lai et al [11] propose a residual attention network to predict video saliency. The network uses two tightly coupled streams to extract appearance and motion features, and then a lightweight ConvGRU, an alternative to ConvLSTM, to model long-term temporal dependence.…”
Section: B Modern Dynamic Saliency Modelsmentioning
confidence: 99%
“…Additionally, structure priors in images can be efficiently fused using multi‐modal inputs. For example, spatial ranking maps are supervised by panoptic segmentation labels to alleviate overlapping problems among various classes [42], high‐resolution feature maps are refined with inputs from multiple paths [49], streams, respectively, given stacked optical flows and colour images model motions and appreances [50], and gated convolutional layers enforce boundary information merely processed in a shape stream [32]; images in axial, coronal, and sagittal views are independently segmented and fused into a single result with union operations [43].…”
Section: Related Workmentioning
confidence: 99%