MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Yu, Jiashuo; Cheng, Ying; Zhao, Ran; Feng, Rui; Zhang, Yuejie

doi:10.1145/3503161.3547869

Cited by 29 publications

(11 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In weakly supervised learning, data labels are usually low quality. In the AVVP task, video-level labels are used for training, and precise labels are used at test time (Tian et al 2020;Wu and Yang 2021;Yu et al 2021a).…”

Section: Data Annotationsmentioning

confidence: 99%

Deep learning on multi-view sequential data: a survey

et al. 2022

View full text Add to dashboard Cite

With the progress of human daily interaction activities and the development of industrial society, a large amount of media data and sensor data become accessible. Humans collect these multi-source data in chronological order, called multi-view sequential data (MvSD). MvSD has numerous potential application domains, including intelligent transportation, climate science, health care, public safety and multimedia, etc. However, as the volume and scale of MvSD increases, the traditional machine learning methods become difficult to withstand such large-scale data, and it is no longer appropriate to use hand-craft features to represent these complex data. In addition, there is no general framework in the process of mining multi-view relationships and integrating multi-view information. In this paper, We first introduce four common data types that constitute MvSD, including point data, sequence data, graph data, and raster data. Then, we summarize the technical challenges of MvSD. Subsequently, we review the recent progress in deep learning technology applied to MvSD. Meanwhile, we discuss how the network represents and learns features of MvSD. Finally, we summarize the applications of MvSD in different domains and give potential research directions.

show abstract

Section: Data Annotationsmentioning

confidence: 99%

Deep learning on multi-view sequential data: a survey

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Some methods focus on network design. Yu et al [45] propose a multimodal pyramid attentional network that consists of multiple pyramid units to encode the temporal features. Jiang et al [47] use two extra independent visual and audio prediction networks to alleviate the label interference between audio and visual modalities.…”

Section: Related Workmentioning

confidence: 99%

“…Existing methods usually adopt the objective function formulated in Eq. 1 for model training [42], [45], [48], [49], where y v ∈ R 1×C is the video-level label obtained by label smoothing. Instead, we use the video-level pseudo label ŷv ∈ R 1×C generated by our PLG module as new supervision.…”

Section: B Richness-aware Lossmentioning

confidence: 99%

Improving Audio-Visual Video Parsing with Pseudo Visual Labels

Zhou¹,

Guo²,

Zhong³

et al. 2023

Preprint

View full text Add to dashboard Cite

Audio-Visual Video Parsing is a task to predict the events that occur in video segments for each modality. It often performs in a weakly supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known event labels for each modality. However, the labels are still limited to the video level, and the temporal boundaries of event timestamps remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the CLIP model to estimate the events in each video segment based on visual modality to generate segment-level pseudo labels. A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness. A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs. We perform extensive experiments on the LLP dataset and demonstrate that our method can generate high-quality segment-level pseudo labels with the help of our newly proposed loss and the label denoising strategy. Our method achieves state-of-the-art audio-visual video parsing performance.

show abstract

“…multi-modal systems with audio-visual understanding ability. Various audio-visual tasks have been studied, including sound source localization [8,[19][20][21][26][27][28], audio-visual event localization [32,33,35,39], audio-visual video parsing [11,18,31] and audio-visual segmentation [37,38]. In this work, we focus on unsupervised visual sound source localization, with the aim of localizing the sound-source objects in an image using its paired audio clip, but without relying on any manual annotations.…”

Section: Indicates Equal Contributionmentioning

confidence: 99%

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Sun¹,

Zhang²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations (e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at https://github.com/OpenNLPLab/FNAC_AVL.

show abstract

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Cited by 29 publications

References 24 publications

Deep learning on multi-view sequential data: a survey

Deep learning on multi-view sequential data: a survey

Improving Audio-Visual Video Parsing with Pseudo Visual Labels

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Contact Info

Product

Resources

About