ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Jain, Samyak; Yarlagadda, Pradeep; Jyoti, Shreyank; Karthik, S.; Subramanian, Ramanathan; Gandhi, Vineet

doi:10.1109/iros51168.2021.9635989

Cited by 37 publications

(19 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…TASED [25] aggregates spatio-temporal features through the use of auxiliary pooling for reducing the temporal dimension. ViNet [16] integrates S3D features from multiple hierarchical levels by employing trilinear interpolation and 3D convolutions. UNISAL [6] proposes a multi-objective unified framework for both 2D and 3D saliency with domain-specific modules and a lightweight recurrent architecture to handle temporal dynamics; While single-decoder approaches are common, multi-decoder output integration has recently attracted interest.…”

Section: Related Workmentioning

confidence: 99%

“…RecSal [30] predicts multiple saliency maps in a multiobjective training framework. Recent works introduce more [4,18,22,37,42]; U-Net-like architecture, with features sharing between encoder and decoder [6,16,19,25]; Deep Layer Aggregation [39]; Hierarchical intermediate map aggregation [1,30,35].…”

Section: Related Workmentioning

confidence: 99%

“…It is interesting to note that, in spite of the remarkably different research directions followed by the variety of works in the literature, top results over video saliency prediction benchmarks are very close [1,3,36], suggesting that predictions of different models are similar. We assessed the validity of this conclusion by comparing three of the best performing methods on the DHF1K dataset [36] -TASED [25], HD2S [1] and ViNet [16] -not in terms of their scores on summary metrics, but in terms of the relative similarity of the predicted saliency maps. To illustrate our findings, Fig.…”

Section: Introductionmentioning

confidence: 99%

“…Figure 3: A taxonomy of decoding strategies commonly employed in video saliency prediction. Subfigures(topleft, top-right, bottom-left, bottom-right): Independent encoder and decoder, with no feature sharing between the two paths[4,18,22,37,42]; U-Net-like architecture, with features sharing between encoder and decoder[6,16,19,25]; Deep Layer Aggregation[39]; Hierarchical intermediate map aggregation[1,30,35].…”

mentioning

confidence: 99%

See 3 more Smart Citations

TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders using Hierarchical Maps Distillation

Palazzo

Salanitri

et al. 2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

Video saliency prediction has recently attracted attention of the research community, as it is an upstream task for several practical applications. However, current solutions are particularly computationally demanding, especially due to the wide usage of spatio-temporal 3D convolutions. We observe that, while different model architectures achieve similar performance on benchmarks, visual variations between predicted saliency maps are still significant. Inspired by this intuition, we propose a lightweight model that employs multiple simple heterogeneous decoders and adopts several practical approaches to improve accuracy while keeping computational costs low, such as hierarchical multi-map knowledge distillation, multi-output saliency prediction, unlabeled auxiliary datasets and channel reduction with teacher assistant supervision. Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods on DFH1K, UCF-Sports and Hol-lywood2 benchmarks, while enhancing significantly the efficiency of the model.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders using Hierarchical Maps Distillation

Palazzo

Salanitri

et al. 2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

show abstract

“…For visual-audio saliency prediction, few DNN models have been proposed. Jain et al (2020) proposed a 3D convolutional encoder-decoder architecture, named AViNet, to predict visual saliency. In AViNet, SoundNet (Aytar et al, 2016) is applied to extract audio features and S3D (Xie et al, 2018) for visual features, which are fused to output saliency maps of videos.…”

Section: Saliency Predictionmentioning

confidence: 99%

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Qiao¹,

Liu²,

Xu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first introduce a large-scale database of multi-face video in visual-audio condition (MVVA), containing eye-tracking data and sound source annotations. Using this database, we find that sound influences human attention, and conversly attention offers a cue to determine sound source on multi-face video. Guided by these findings, a visual-audio multi-task network (VAM-Net) is introduced to predict saliency and locate sound source. VAM-Net consists of three branches corresponding to visual, audio and face modalities. Visual branch has a two-stream architecture to capture spatial and temporal information. Face and audio branches encode audio signals and faces, respectively. Finally, a spatio-temporal multi-modal graph (STMG) is constructed to model the interaction among multiple faces. With joint optimization of these branches, the intrinsic correlation of the tasks of saliency prediction and sound source localization is utilized and their performance is boosted by each other. Experiments show

show abstract

GFNet: gated fusion network for video saliency prediction

Wu,

Zhou,

Sun

et al. 2023

Appl Intell

View full text Add to dashboard Cite

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Cited by 37 publications

References 43 publications

TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders using Hierarchical Maps Distillation

TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders using Hierarchical Maps Distillation

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

GFNet: gated fusion network for video saliency prediction

Contact Info

Product

Resources

About