Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3547869
|View full text |Cite
|
Sign up to set email alerts
|

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(11 citation statements)
references
References 24 publications
0
11
0
Order By: Relevance
“…In weakly supervised learning, data labels are usually low quality. In the AVVP task, video-level labels are used for training, and precise labels are used at test time (Tian et al 2020;Wu and Yang 2021;Yu et al 2021a).…”
Section: Data Annotationsmentioning
confidence: 99%
“…In weakly supervised learning, data labels are usually low quality. In the AVVP task, video-level labels are used for training, and precise labels are used at test time (Tian et al 2020;Wu and Yang 2021;Yu et al 2021a).…”
Section: Data Annotationsmentioning
confidence: 99%
“…Some methods focus on network design. Yu et al [45] propose a multimodal pyramid attentional network that consists of multiple pyramid units to encode the temporal features. Jiang et al [47] use two extra independent visual and audio prediction networks to alleviate the label interference between audio and visual modalities.…”
Section: Related Workmentioning
confidence: 99%
“…Existing methods usually adopt the objective function formulated in Eq. 1 for model training [42], [45], [48], [49], where y v ∈ R 1×C is the video-level label obtained by label smoothing. Instead, we use the video-level pseudo label ŷv ∈ R 1×C generated by our PLG module as new supervision.…”
Section: B Richness-aware Lossmentioning
confidence: 99%
“…multi-modal systems with audio-visual understanding ability. Various audio-visual tasks have been studied, including sound source localization [8,[19][20][21][26][27][28], audio-visual event localization [32,33,35,39], audio-visual video parsing [11,18,31] and audio-visual segmentation [37,38]. In this work, we focus on unsupervised visual sound source localization, with the aim of localizing the sound-source objects in an image using its paired audio clip, but without relying on any manual annotations.…”
Section: Indicates Equal Contributionmentioning
confidence: 99%