2023
DOI: 10.48550/arxiv.2301.02111
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Abstract: We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-con… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
53
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(56 citation statements)
references
References 9 publications
3
53
0
Order By: Relevance
“…It would need to appropriately say answers for different scientific and mathematical textual representations. A recent model such as Microsoft's VALL-E which can simulate a person's voice could be used Wang et al (2023).…”
Section: Speech Outputmentioning
confidence: 99%
“…It would need to appropriately say answers for different scientific and mathematical textual representations. A recent model such as Microsoft's VALL-E which can simulate a person's voice could be used Wang et al (2023).…”
Section: Speech Outputmentioning
confidence: 99%
“…Generative Models Incorporating other generative models in Ubiq-Genie, beyond the employed image and text synthesis models, could lead to interesting applications. Potential models to be integrated could be capable of synthesising 3D models from text or images such as Point-E [16], personalised speech from text such as VALL-E [24], or audio from text or images such as Make-An-Audio [9] and MusicLM [1]. In addition, the currently implemented services could be expanded to build more advanced types of applications and experiences.…”
Section: Services and Applicationsmentioning
confidence: 99%
“…VALL-E (Wang et al, 2023) instead relies on a hybrid approach, where the tokens corresponding to the first RVQ level are predicted autoregressively, and the subsequent levels are produced non-autoregressively. The latter is achieved by a model that sums up the embeddings from the same RVQ input frame, and applies bidirectional self-attention to predict all tokens from RVQ level q + 1 given all tokens from levels 1, .…”
Section: Related Workmentioning
confidence: 99%
“…Modeling discrete representations of audio produced by neural codecs (Zeghidour et al, 2022;Défossez et al, 2022) makes the task of audio generation amenable to the powerful Transformer-based sequence-to-sequence modeling approaches (Vaswani et al, 2017). Casting unconditional and conditional audio generation as sequence-to-sequence modeling has unlocked rapid progress in speech continuation (Borsos et al, 2022), text-to-speech (Wang et al, 2023;Kharitonov et al, 2023), and general audio and music generation (Kreuk et al, 2022;Agostinelli et al, 2023).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation