The Conversation: Deep Audio-Visual Speech Enhancement

Afouras, Triantafyllos; Chung, Joon Son; Zisserman, Andrew

doi:10.21437/interspeech.2018-1400

Cited by 283 publications

(330 citation statements)

References 40 publications

Supporting

Mentioning

328

Contrasting

Unclassified

Order By: Relevance

“…Recent research on leveraging visual modality has led to impressive results in speech separation. In these studies, various representations of the visual information such as lip appearance [17,16] and optical flow [18,19] are used to estimate the time-frequency (TF) mask. In this paper, the audio-visual speech separation 1 component used in the pipelined system is based on our previous work in [21].…”

Section: Audio-visual Speech Separationmentioning

confidence: 99%

“…Motivated by the bimodal nature of human speech perception [2,10], and the invariance of visual information to acoustic signal corruption, audio-visual speech recognition (AVSR) technologies [11,12,13,14] can also be used for overlapped speech separation [15,16,17,18,19,20,21,22] and the back-end recognition component. However, the use of visual modality in the recognition stage of system development for overlapped speech remains limited to date.…”

Section: Introductionmentioning

confidence: 99%

“…However, it provides limited flexibility in fusion when the visual inputs become less reliable and poorer in quality [25,26,27]. Third, stateof-the-art systems developed for overlapped speech recognition are often based on a pipelined architecture containing explicit speech separation and recognition components [28,29,17]. The front-end separation components are often optimized using error costs that are different from those used in the back-end recognition components.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

show abstract

Section: Audio-visual Speech Separationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The bottleneck layer activations were used as visual deep features. We also tried using the visual features in [15,16], however we found that these were less effective due to the unavailability of the training data used in [15,16]. The network architectures were the same as the one in the proposed model with E2EASR features.…”

Section: Baselinesmentioning

confidence: 99%

Improving Voice Separation by Incorporating End-To-End Speech Recognition

Singh

Basak

Sudarsanam

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data. In this work, we propose to explicitly incorporate the phonetic and linguistic nature of speech by taking a transfer learning approach using an end-to-end automatic speech recognition (E2EASR) system. The voice separation is conditioned on deep features extracted from E2EASR to cover the long-term dependence of phonetic aspects. Experimental results on speech separation and enhancement task on the AVSpeech dataset show that the proposed method significantly improves the signal-todistortion ratio over the baseline model and even outperforms an audio visual model, that utilizes visual information of lip movements.

show abstract

“…Audio-visual speech enhancement methods incorporate also the visual information (video frames) associated with the noisy speech, aiming to improve the quality of the enhanced speech signal [11][12][13]. Using the video modality is Xavier Alameda-Pineda acknowledges ANR and the IDEX for funding the ML3RI project.…”

Section: Introductionmentioning

confidence: 99%

Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders

Sadeghi

Alameda-Pineda

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audio-visual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is not frontal or lips region is occluded. In this paper, we propose a robust unsupervised audio-visual speech enhancement method based on a per-frame VAE mixture model. This mixture model consists of a trained audio-only VAE and a trained audio-visual VAE. The motivation is to skip noisy visual frames by switching to the audio-only VAE model. We present a variational expectation-maximization method to estimate the parameters of the model. Experiments show the promising performance of the proposed method.

show abstract

The Conversation: Deep Audio-Visual Speech Enhancement

Cited by 283 publications

References 40 publications

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Improving Voice Separation by Incorporating End-To-End Speech Recognition

Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders

Contact Info

Product

Resources

About