Revisiting Over-Smoothness in Text to Speech

Ren, Yang; Xu, Tao; Qin, Tao; Zhang, Zhao; Liu, Tie-Yan

doi:10.18653/v1/2022.acl-long.564

Cited by 28 publications

(15 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There have been multiple approaches for augmenting E2E models and training procedures to incorporate unpaired text data. Broadly speaking, these approaches use some combination of an LM trained on text data (shallow, cold, deep fusion [10,11,12,13]) and a multi-stage training procedure that incorporates unpaired data ("weak distillation" [14], "backtranslation" [15], "cycle-consistency" [16,17,18]). Each approach produces improvements in performance, but also increases some combination of model size, training and inference complexity, making it less desirable for on-device applications.…”

Section: Introductionmentioning

confidence: 99%

A Deliberation-Based Joint Acoustic and Text Decoder

Sepand¹,

Sainath²,

Hu³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD's use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.

show abstract

Section: Introductionmentioning

confidence: 99%

A Deliberation-Based Joint Acoustic and Text Decoder

Sepand¹,

Sainath²,

Hu³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…During inference stage, we take samples z from the prior distribution and feed them into the post-net reversely to generate the final mel-spectrogram. As proved in (Ren et al 2022), this flow-based module enhances the capability of modelling complex data distributions, which helps to address one-tomany mapping problem.…”

Section: Post-netmentioning

confidence: 92%

“…where Y j denotes the j th frame of the ground truth melspectrogram with length T m , and Ŷj stands for the j th frame of predicted mel-spectrogram. Note that the variance predictors simplify the acoustic target distribution by providing conditional information, thereby mitigating the one-to-many mapping issue (Ren et al 2022). We analyse the effect of variance information in our experiment section.…”

Section: Linguistic Predictormentioning

confidence: 99%

“…With the help of the acoustic variations, the model can not only ease the one-to-many mapping but also learn prosody of speech which is a key factor for realistic speech synthesis (Skerry-Ryan et al 2018;Sun et al 2020). To further address the one-to-many mapping, we use a flow based post-net which refines acoustic representations with enhanced modelling capability of capturing fine-grained details (Ren et al 2022). Combined with the variance information, the post-net helps to learn the complex one-to-many-mapping between phonemes and speech, thereby improving the naturalness of the synthesised speech.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Kim,

Chung

2024

AAAI

View full text Add to dashboard Cite

The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.

show abstract

“…However, there are still many challenges and opportunities in this domain [11], particularly when it comes to exploiting large amounts of data. On the speech generation side, one of the main difficulties is to build a model that correctly aligns the phonetic and acoustic sequences, leading to a natural prosody with fluent speech and high intelligibility, while still cap-turing the prosody variations [25]. On the opposite side, automatic speech recognition systems struggle with long-tail words recognition [35], and speech vs background disentanglement [18].…”

Section: Introductionmentioning

confidence: 99%

Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows

et al. 2021

View full text Add to dashboard Cite

Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero-or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.

show abstract

Revisiting Over-Smoothness in Text to Speech

Cited by 28 publications

References 27 publications

A Deliberation-Based Joint Acoustic and Text Decoder

A Deliberation-Based Joint Acoustic and Text Decoder

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows

Contact Info

Product

Resources

About