ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746107
|View full text |Cite
|
Sign up to set email alerts
|

Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 5 publications
0
5
0
Order By: Relevance
“…15.87 0.57% Tacotron2 [3] 16.20 0.56% MixerTTS [6] 10.29 0.87% LightSpeech [10] 0.76 11.84% us to get the overall picture of our model performance as a function of memory, computational budget and time [24] instead of focusing only on selected favorable metrics.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…15.87 0.57% Tacotron2 [3] 16.20 0.56% MixerTTS [6] 10.29 0.87% LightSpeech [10] 0.76 11.84% us to get the overall picture of our model performance as a function of memory, computational budget and time [24] instead of focusing only on selected favorable metrics.…”
Section: Resultsmentioning
confidence: 99%
“…The 30.81 0.86% Tacotron2 [3] 23.81 1.12% MixerTTS [6] 20.06 1.33% LightSpeech [10] 1.80 14.78% tures. The fused features are then up sampled to the correct mel sequence length M using the predicted Duration:…”
Section: Model Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…Other relevant methods of heteronym resolution and verification include the morphological rewriting rules [12] and the context-dependent phone-based HMMs that use acoustic features [13]. [14] skips the phoneme representation in lieu of passing graphemes into a language model to generate its text representation. We plan to add these to our paper to address this broader context.…”
Section: Introductionmentioning
confidence: 99%
“…Hwang et al (2021); Song et al (2022); Lajszczak et al (2022) claimed that the performance of NAR-TTS is poor when the training data is insufficient, devising effective data augmentation methods. Kim, Kong, and Son (2021) Tatanov, Beliaev, and Ginsburg (2022) boosted the expressiveness of speech by applying various methods proposed in the field of natural language processing (NLP) to the speech domain. Especially, GraphSpeech (Liu, Sisman, and Li 2021) and Relational Gated Graph Network (RGGN) (Zhou et al 2022) claimed the syntactic and semantic information of text affects the naturalness and expressiveness of speech.…”
Section: Introductionmentioning
confidence: 99%