Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Donahue, Chris; Li, Bo; Prabhavalkar, Rohit

doi:10.1109/icassp.2018.8462581

Cited by 186 publications

(142 citation statements)

References 30 publications

(43 reference statements)

Supporting

Mentioning

124

Contrasting

Order By: Relevance

“…GAN-based training for SE [14] has received increased attention. GAN is employed to constrain the estimated signals close to the clean signals, which was shown to improve objective and subjective SE criterion, but it did not contribute to improvement in terms of ASR [26].…”

Section: Related Workmentioning

confidence: 99%

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Kinoshita

Ochiai

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly. However, ASR performance in noisy conditions of single-channel systems remains unsatisfactory. Indeed, most single-channel speech enhancement (SE) methods (denoising) have brought only limited performance gains over state-of-the-art ASR back-end trained on multicondition training data. Recently, there has been much research on neural network-based SE methods working in the time-domain showing levels of performance never attained before. However, it has not been established whether the high enhancement performance achieved by such time-domain approaches could be translated into ASR. In this paper, we show that a single-channel time-domain denoising approach can significantly improve ASR performance, providing more than 30 % relative word error reduction over a strong ASR back-end on the real evaluation data of the single-channel track of the CHiME-4 dataset. These positive results demonstrate that single-channel noise reduction can still improve ASR performance, which should open the door to more research in that direction.

show abstract

Section: Related Workmentioning

confidence: 99%

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Kinoshita

Ochiai

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Moreover, there are many other studies on auditory data which work on audio spectrograms and consider them as 2D images. For instance, Donahue et al [86] as well as Michelsanti, Tan et al [87] employ GAN on audio spectrograms for speech enhancement. Fan et al [88] propose a GAN for separating the singing voice from background music.…”

Section: Related Workmentioning

confidence: 99%

Probabilistic Forecasting of Sensory Data With Generative Adversarial Networks – ForGAN

et al. 2019

View full text Add to dashboard Cite

Time series forecasting is one of the challenging problems for humankind. Traditional forecasting methods using mean regression models have severe shortcomings in reflecting real-world fluctuations. While new probabilistic methods rush to rescue, they fight with technical difficulties like quantile crossing or selecting a prior distribution. To meld the different strengths of these fields while avoiding their weaknesses as well as to push the boundary of the state-of-the-art, we introduce ForGAN âȂŞ one step ahead probabilistic forecasting with generative adversarial networks. ForGAN utilizes the power of the conditional generative adversarial network to learn the data generating distribution and compute probabilistic forecasts from it. We argue how to evaluate ForGAN in opposition to regression methods. To investigate probabilistic forecasting of ForGAN, we create a new dataset and demonstrate our method abilities on it. This dataset will be made publicly available for comparison. Furthermore, we test ForGAN on two publicly available datasets, namely Mackey-Glass dataset [1] and Internet traffic dataset (A5M) [2] where the impressive performance of ForGAN demonstrate its high capability in forecasting future values.

show abstract

“…Intentional noise has been added to machine translation data [9,10]. Alternate methods for collecting large scale audio data include Generative Adversarial Networks [11] and manual recording [12].…”

Section: Spoken Question Answering Datasetsmentioning

confidence: 99%

Mitigating Noisy Inputs for Question Answering

et al. 2019

View full text Add to dashboard Cite

Natural language processing systems are often downstream of unreliable inputs: machine translation, optical character recognition, or speech recognition. For instance, virtual assistants can only answer your questions after understanding your speech. We investigate and mitigate the effects of noise from Automatic Speech Recognition systems on two factoid Question Answering (QA) tasks.Integrating confidences into the model and forced decoding of unknown words are empirically shown to improve the accuracy of downstream neural QA systems. We create and train models on a synthetic corpus of over 500,000 noisy sentences and evaluate on two human corpora from Quizbowl and Jeopardy! competitions. 1

show abstract

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Cited by 186 publications

References 30 publications

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Probabilistic Forecasting of Sensory Data With Generative Adversarial Networks – ForGAN

Mitigating Noisy Inputs for Question Answering

Contact Info

Product

Resources

About