Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings

Tatanov, Oktai; Beliaev, Stanislav; Ginsburg, Boris

doi:10.1109/icassp43922.2022.9746107

Cited by 5 publications

(5 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…15.87 0.57% Tacotron2 [3] 16.20 0.56% MixerTTS [6] 10.29 0.87% LightSpeech [10] 0.76 11.84% us to get the overall picture of our model performance as a function of memory, computational budget and time [24] instead of focusing only on selected favorable metrics.…”

Section: Resultsmentioning

confidence: 99%

“…The 30.81 0.86% Tacotron2 [3] 23.81 1.12% MixerTTS [6] 20.06 1.33% LightSpeech [10] 1.80 14.78% tures. The fused features are then up sampled to the correct mel sequence length M using the predicted Duration:…”

Section: Model Architecturementioning

confidence: 99%

“…In terms of natural sounding voice generation, neural TTS systems such as FastSpeech2 [1], FastPitch [2], Tacotron2 [3], Deep Voice 3 [4], TransformerTTS [5] and Mixer-TTS [6] dominate the state of the art performance in MOS scores.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

EfficientSpeech: An On-Device Text to Speech Model

Atienza

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. These models are characterized by large memory footprints and substantial number of operations due to the long-standing focus on speech quality with cloud inference in mind. Neural TTS models are generally not designed to perform standalone speech syntheses on resource-constrained and no Internet access edge devices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed. EfficientSpeech uses a shallow non-autoregressive pyramid-structure transformer forming a U-Network. EfficientSpeech has 266k parameters and consumes 90 MFLOPS only or about 1% of the size and amount of computation in modern compact models such as Mixer-TTS. EfficientSpeech achieves an average mel generation real-time factor of 104.3 on an RPi4. Human evaluation shows only a slight degradation in audio quality as compared to FastSpeech2.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Model Architecturementioning

confidence: 99%

See 1 more Smart Citation

EfficientSpeech: An On-Device Text to Speech Model

Atienza

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Other relevant methods of heteronym resolution and verification include the morphological rewriting rules [12] and the context-dependent phone-based HMMs that use acoustic features [13]. [14] skips the phoneme representation in lieu of passing graphemes into a language model to generate its text representation. We plan to add these to our paper to address this broader context.…”

Section: Introductionmentioning

confidence: 99%

Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners

Huang¹,

Bakhturina²,

Tatanov³

2023

Preprint

View full text Add to dashboard Cite

Grapheme-to-phoneme (G2P) transduction is part of the standard text-to-speech (TTS) pipeline. However, G2P conversion is difficult for languages that contain heteronyms -words that have one spelling but can be pronounced in multiple ways. G2P datasets with annotated heteronyms are limited in size and expensive to create, as human labeling remains the primary method for heteronym disambiguation. We propose a RAD-TTS Aligner-based pipeline to automatically disambiguate heteronyms in datasets that contain both audio with text transcripts. The best pronunciation can be chosen by generating all possible candidates for each heteronym and scoring them with an Aligner model. The resulting labels can be used to create training datasets for use in both multi-stage and end-to-end G2P systems.

show abstract

“…Hwang et al (2021); Song et al (2022); Lajszczak et al (2022) claimed that the performance of NAR-TTS is poor when the training data is insufficient, devising effective data augmentation methods. Kim, Kong, and Son (2021) Tatanov, Beliaev, and Ginsburg (2022) boosted the expressiveness of speech by applying various methods proposed in the field of natural language processing (NLP) to the speech domain. Especially, GraphSpeech (Liu, Sisman, and Li 2021) and Relational Gated Graph Network (RGGN) (Zhou et al 2022) claimed the syntactic and semantic information of text affects the naturalness and expressiveness of speech.…”

Section: Introductionmentioning

confidence: 99%

RWEN-TTS: Relation-Aware Word Encoding Network for Natural Text-to-Speech Synthesis

Oh¹,

Noh²,

Hong³

et al. 2023

AAAI

View full text Add to dashboard Cite

With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.

show abstract

Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings

Cited by 5 publications

References 5 publications

EfficientSpeech: An On-Device Text to Speech Model

EfficientSpeech: An On-Device Text to Speech Model

Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners

RWEN-TTS: Relation-Aware Word Encoding Network for Natural Text-to-Speech Synthesis

Contact Info

Product

Resources

About