Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Valentini-Botinhao, Cassia; Yamagishi, Junichi

doi:10.1109/taslp.2018.2828980

Cited by 39 publications

(18 citation statements)

References 36 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A multi-speaker reverberant speech database 3 [29] was used in our experiments. From the database, we used a reverberant subset of 28 speakers that contained 11,572 utterances and 18 reverberation types (9 rooms × 2 microphones positions).…”

Section: Data and Feature Configurationmentioning

confidence: 99%

See 1 more Smart Citation

Reverberation Modeling for Source-Filter-Based Neural Vocoder

Ai¹,

Wang²,

Yamagishi³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

This paper presents a reverberation module for source-filterbased neural vocoders that improves the performance of reverberant effect modeling. This module uses the output waveform of neural vocoders as an input and produces a reverberant waveform by convolving the input with a room impulse response (RIR). We propose two approaches to parameterizing and estimating the RIR. The first approach assumes a global time-invariant (GTI) RIR and directly learns the values of the RIR on a training dataset. The second approach assumes an utterance-level time-variant (UTV) RIR, which is invariant within one utterance but varies across utterances, and uses another neural network to predict the RIR values. We add the proposed reverberation module to the phase spectrum predictor (PSP) of a HiNet vocoder and jointly train the model. Experimental results demonstrate that the proposed module was helpful for modeling the reverberation effect and improving the perceived quality of generated reverberant speech. The UTV-RIR was shown to be more robust than the GTI-RIR to unknown reverberation conditions and achieved a perceptually better reverberation effect.

show abstract

Section: Data and Feature Configurationmentioning

confidence: 99%

“…We used an open source toolkit [33] to blindly estimate T60 from the reverberant speech. The T60 estimation errors were calculated as the difference between the estimated T60 and the ground-truth T60 (T60n) reported in the database paper [29].…”

Section: Objective Evaluation -T60 Comparisons -T60 Estimation Errorsmentioning

confidence: 99%

Reverberation Modeling for Source-Filter-Based Neural Vocoder

Ai¹,

Wang²,

Yamagishi³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, their success is limited in mid SNR values. Botinhao et al [8] proposed recently an SE technique for noise robust speech synthesis based on recurrent networks. However, this technique operates in feature domain instead of waveform domain resulting in the implicit introduction of vocoding quality in the enhanced speech.…”

Section: Introductionmentioning

confidence: 99%

Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN

et al. 2019

View full text Add to dashboard Cite

The quality of speech synthesis systems can be significantly deteriorated by the presence of background noise in the recordings. Despite the existence of speech enhancement techniques for effectively suppressing additive noise under low signal-tonoise (SNR) conditions, these techniques have been neither designed nor tested in speech synthesis tasks where background noise has relatively lower energy. In this paper, we propose a speech enhancement technique based on generative adversarial networks (GANs) which acts as a preprocessing step of speech synthesis. Motivated by the speech enhancement generative adversarial network (SEGAN) approach and recent advances in deep learning, we propose to use Wasserstein GAN (WGAN) with gradient penalty and gated activation functions to the autoencoder network of SEGAN. We studied the impact of the proposed method on a data set consisting of 28 speakers and different noise types with 3 different SNR level. The effectiveness of the proposed method in the context of speech synthesis is demonstrated through the training of WaveNet vocoder. We compare our method against SEGAN. Both subjective and objective metrics confirm that the proposed speech enhancement approach outperforms SEGAN in terms of speech synthesis quality.

show abstract

“…The biggest challenge in building personalized TTS systems is to obtain a high quality training corpus from a particular voice to either build a speaker-dependent model or a speaker-adapted model using a pre-trained base model [4,5]. In any case, the quality of synthetic voices is highly affected by the presence of noise and reverberation in the training corpus [6,7]. One alternative is to identify and discard corrupted data, but this solution is only feasible when a large amount of training data is available, which is not the typical case in TTS personalization [8].…”

Section: Introductionmentioning

confidence: 99%

“…However, there are not many studies about the effects of noise, reverberation, and the application of speech enhancement techniques for TTS. The most detailed study we found in the literature is [7], in which the authors evaluate the effects of noise and reverberation on a speaker-adapted TTS system and propose a TF masking method based on a Deep-Neural Network (DNN) to enhance the training data. The objective of this paper is to perform a thorough assessment of how noise and reverberation affect the different statistical models that compose the TTS system or are involved in its training: the Forced-Aligner (FA), the Acoustic Model (AM), and the Duration Model (DM).…”

Section: Introductionmentioning

confidence: 99%

Investigating the Effects of Noisy and Reverberant Speech in Text-to-Speech Systems

Ayllón

Sánchez-Hevia

Figueroa³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

The quality of the voices synthesized by a Text-to-Speech (TTS) system depends on the quality of the training data. In real case scenario of TTS personalization from user's voice recordings, the latter are usually affected by noise and reverberation. Speech enhancement can be useful to clean the corrupted speech but it is necessary to understand the effects that noise and reverberation have on the different statistical models that compose the TTS system. In this work we perform a thorough study of how noise and reverberation impact the acoustic and duration models of the TTS system. We also evaluate the effectiveness of time-frequency masking for cleaning the training data. Objective and subjective evaluations reveal that under normal recording scenarios noise leads to a higher degradation than reverberation in terms of naturalness of the synthesized speech.

show abstract

Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Cited by 39 publications

References 36 publications

Reverberation Modeling for Source-Filter-Based Neural Vocoder

Reverberation Modeling for Source-Filter-Based Neural Vocoder

Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN

Investigating the Effects of Noisy and Reverberant Speech in Text-to-Speech Systems

Contact Info

Product

Resources

About