2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021
DOI: 10.1109/iros51168.2021.9635989
|View full text |Cite
|
Sign up to set email alerts
|

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 37 publications
(19 citation statements)
references
References 43 publications
0
19
0
Order By: Relevance
“…TASED [25] aggregates spatio-temporal features through the use of auxiliary pooling for reducing the temporal dimension. ViNet [16] integrates S3D features from multiple hierarchical levels by employing trilinear interpolation and 3D convolutions. UNISAL [6] proposes a multi-objective unified framework for both 2D and 3D saliency with domain-specific modules and a lightweight recurrent architecture to handle temporal dynamics; While single-decoder approaches are common, multi-decoder output integration has recently attracted interest.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…TASED [25] aggregates spatio-temporal features through the use of auxiliary pooling for reducing the temporal dimension. ViNet [16] integrates S3D features from multiple hierarchical levels by employing trilinear interpolation and 3D convolutions. UNISAL [6] proposes a multi-objective unified framework for both 2D and 3D saliency with domain-specific modules and a lightweight recurrent architecture to handle temporal dynamics; While single-decoder approaches are common, multi-decoder output integration has recently attracted interest.…”
Section: Related Workmentioning
confidence: 99%
“…RecSal [30] predicts multiple saliency maps in a multiobjective training framework. Recent works introduce more [4,18,22,37,42]; U-Net-like architecture, with features sharing between encoder and decoder [6,16,19,25]; Deep Layer Aggregation [39]; Hierarchical intermediate map aggregation [1,30,35].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…For visual-audio saliency prediction, few DNN models have been proposed. Jain et al (2020) proposed a 3D convolutional encoder-decoder architecture, named AViNet, to predict visual saliency. In AViNet, SoundNet (Aytar et al, 2016) is applied to extract audio features and S3D (Xie et al, 2018) for visual features, which are fused to output saliency maps of videos.…”
Section: Saliency Predictionmentioning
confidence: 99%