Deep Audio-Visual Learning: A Survey

Zhu, Hao; Luo, Mandi; Wang, Rui; Zheng, Aihua; He, Ran

doi:10.48550/arxiv.2001.04758

Cited by 14 publications

(12 citation statements)

References 129 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They record the activities from different aspects, and cooperate together to help the viewer understand the video content. Recently, multimodal learning has proved that audio and vision modalities share a consistency space, and there are semantic relations between them [19], [20]. Lots of relevant video analysis tasks have demonstrated that the performance is promoted by utilizing the multimodal information in previous single modality tasks [21]- [23].…”

Section: A Motivation and Overviewmentioning

confidence: 99%

AudioVisual Video Summarization

Zhao

Gong

2021

Preprint

View full text Add to dashboard Cite

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, existing approaches just exploit the visual information while neglect the audio information. In this paper, we argue that the audio modality can assist vision modality to better understand the video content and structure, and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream LSTM is utilized to encode the audio and visual feature sequentially by capturing their temporal dependency. 2) the audiovisual fusion LSTM is employed to fuse the two modalities by exploring the latent consistency between them. 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information, and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part, and the superiority of AVRN compared to those approaches just exploiting visual information for video summarization.

show abstract

Section: A Motivation and Overviewmentioning

confidence: 99%

AudioVisual Video Summarization

Zhao

Gong

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Lastly, to enhance the quality of the coarse outputs and obtain fine-grained results, authors provided two-stage GAN network in [47]. For a detailed review of audio-image translation tasks, please refer to a recent survey [50].…”

Section: Related Workmentioning

confidence: 99%

Ear2Face: Deep Biometric Modality Mapping

Yaman,

Eyiokur,

Ekenel

2020

Preprint

View full text Add to dashboard Cite

In this paper, we explore the correlation between different visual biometric modalities. For this purpose, we present an end-to-end deep neural network model that learns a mapping between the biometric modalities. Namely, our goal is to generate a frontal face image of a subject given his/her ear image as the input. We formulated the problem as a paired image-to-image translation task and collected datasets of ear and face image pairs from the Multi-PIE and FERET datasets to train our GAN-based models. We employed feature reconstruction and style reconstruction losses in addition to adversarial and pixel losses. We evaluated the proposed method both in terms of reconstruction quality and in terms of person identification accuracy. To assess the generalization capability of the learned mapping models, we also run cross-dataset experiments. That is, we trained the model on the FERET dataset and tested it on the Multi-PIE dataset and vice versa. We have achieved very promising results, especially on the FERET dataset, generating visually appealing face images from ear image inputs. Moreover, we attained a very high cross-modality person identification performance, for example, reaching 90.9% Rank-10 identification accuracy on the FERET dataset.

show abstract

“…The joint learning of both audio and visual information has received growing attention in recent years [53,19,15,35,23]. By leveraging data within the two modalities, researchers have shown success in learning audio-visual selfsupervision [4,2,3,25,31,22], audio-visual speech recognition [21,39,48,45], local-ization [47,38,37,34], event localization (parsing) [41,43,40], audio-visual navigation [13,5], cross-modality generation between the two modalities [9,51,8,6,48,7,52,49,42,50] and so on.…”

Section: Joint Audio-visual Learningmentioning

confidence: 99%

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Zhou

Lin

et al. 2020

Preprint

View full text Add to dashboard Cite

Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility. To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.Our key observation is that the task of visually indicated audio separation also maps independent audios to their corresponding visual positions, which shares a similar objective with stereophonic audio generation. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization. Specifically, a novel associative pyramid network architecture is carefully designed for audio-visual feature fusion. Extensive experiments demonstrate that our framework can improve the stereophonic audio generation results while performing accurate sound separation with a shared backbone 1 .

show abstract

Deep Audio-Visual Learning: A Survey

Cited by 14 publications

References 129 publications

AudioVisual Video Summarization

AudioVisual Video Summarization

Ear2Face: Deep Biometric Modality Mapping

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Contact Info

Product

Resources

About