Positive Sample Propagation along the Audio-Visual Event Line

Zhou, Jinxing; Zheng, Liang; Zhong, Yiran; Hao, Shijie; Wang, Meng

doi:10.1109/cvpr46437.2021.00833

Cited by 58 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hanyu X et al [6]proposed to learn inter and intra information between visual and audio modality by adaptive attention and self-attention modal. Jinxing Z et al [3] aggregates relevant information that probably not be available at the same time through the positive sample distribution model. And they all use the auditory guided visual attention module that we will discuss below.…”

Section: Related Workmentioning

confidence: 99%

“…The sound source separation schemes proposed in [18], [19] show that the voices of different speakers can be distinguished by paying attention to the location of the spatial region around the speaker's voice and finding matching sound source information. In the audio-visual event localization task, [12] firstly adopted the auditory guided visual attention mechanism, [3], [6], [9] have followed this attention mechanism.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Past and Future Motion Guided Network for Audio Visual Event Localization

Chen¹,

Yin²,

Jin³

2022

Preprint

View full text Add to dashboard Cite

In recent years, audio-visual event localization has attracted much attention. It's purpose is to detect the segment containing audio-visual events and recognize the event category from untrimmed videos. Existing methods use audio-guided visual attention to lead the model pay attention to the spatial area of the ongoing event, devoting to the correlation between audio and visual information but ignoring the correlation between audio and spatial motion. We propose a past and future motion extraction (pf-ME) module to mine the visual motion from videos ,embedded into the past and future motion guided network (PFAGN), and motion guided audio attention (MGAA) module to achieve focusing on the information related to interesting events in audio modality through the past and future visual motion. We choose AVE as the experimental verification dataset and the experiments show that our method outperforms the state-of-thearts in both supervised and weakly-supervised settings.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Past and Future Motion Guided Network for Audio Visual Event Localization

Chen¹,

Yin²,

Jin³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…(Lin and Wang 2020) devise an Audiovisual Transformer to use audio as the guiding modality to refine visual features by performing spatial attention on contextual frames and instance attention to locate the sound-source within a frame. The Positive Sample Propagation module by (Zhou et al 2021) calculates similarity matrices between audio and visual features of different segments and thresholds them to eliminate insignificant audio-visual pairs. These matrices are used to co-refine similar segments together before fusing the modality information and learning temporal dependencies using LSTMs.…”

Section: Related Workmentioning

confidence: 99%

Decompose the Sounds and Pixels, Recompose the Events

Rao

Khalil

et al. 2022

AAAI

View full text Add to dashboard Cite

In this paper, we propose a framework centering around a novel architecture called the Event Decomposition Recomposition Network (EDRNet) to tackle the Audio-Visual Event (AVE) localization problem in the supervised and weakly supervised settings. AVEs in the real world exhibit common unraveling patterns (termed as Event Progress Checkpoints(EPC)), which humans can perceive through the cooperation of their auditory and visual senses. Unlike earlier methods which attempt to recognize entire event sequences, the EDRNet models EPCs and inter-EPC relationships using stacked temporal convolutions. Based on the postulation that EPC representations are theoretically consistent for an event category, we introduce the State Machine Based Video Fusion, a novel augmentation technique that blends source videos using different EPC template sequences. Additionally, we design a new loss function called the Land-Shore-Sea loss to compactify continuous foreground and background representations. Lastly, to alleviate the issue of confusing events during weak supervision, we propose a prediction stabilization method called Bag to Instance Label Correction. Experiments on the AVE dataset show that our collective framework outperforms the state-of-the-art by a sizable margin.

show abstract

“…By integrating the audio and visual information in multimodal scenes, it is expected to explore more sufficient scene information and overcome the limited perception in single modality. Recently, there have been several works utilizing audio and visual modality to facilitate multimodal scene understanding in different perspectives, such as sound source localization [23,31,34,37,48] and separation [10,13,41,59,61,63], audio inpainting [62], event localization [4,43,64], action recognition [14], video parsing [42,47], captioning [24,40,50], and dialog [1,66].…”

Section: Audio-visual Learningmentioning

confidence: 99%

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Li¹,

Wei²,

Tian³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K questionanswer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewulab.github.io/MUSIC-AVQA/

show abstract

Positive Sample Propagation along the Audio-Visual Event Line

Cited by 58 publications

References 20 publications

Past and Future Motion Guided Network for Audio Visual Event Localization

Past and Future Motion Guided Network for Audio Visual Event Localization

Decompose the Sounds and Pixels, Recompose the Events

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Contact Info

Product

Resources

About