Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Wang, Chengyi; Chen, Sanyuan; Wu, Yu; Zhang, Ziqiang; Zhang, Long; Liu, Shujie; Chen, Zhuo; Liu, Yanqing; Wang, Huaming; Li, Jinyu; Li, He; Zhao, S. J.; Wei, Furu

doi:10.48550/arxiv.2301.02111

Cited by 25 publications

(56 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It would need to appropriately say answers for different scientific and mathematical textual representations. A recent model such as Microsoft's VALL-E which can simulate a person's voice could be used Wang et al (2023).…”

Section: Speech Outputmentioning

confidence: 99%

Can an AI Win Ghana's National Science and Maths Quiz? An AI Grand Challenge for Education

Boateng¹,

Kumbol²,

Kaufmann³

2023

Preprint

View full text Add to dashboard Cite

There is a lack of enough qualified teachers across Africa which hampers efforts to provide adequate learning support such as educational question answering (EQA) to students. An AI system that can enable students to ask questions via text or voice and get instant answers will make high-quality education accessible. Despite advances in the field of AI, there exists no robust benchmark or challenge to enable building such an (EQA) AI within the African context. Ghana's National Science and Maths Quiz competition (NSMQ) is the perfect competition to evaluate the potential of such an AI due to its wide coverage of scientific fields, variety of question types, highly competitive nature, and live, real-world format. The NSMQ is a Jeopardy-style annual live quiz competition in which 3 teams of 2 students compete by answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year. In this position paper, we propose the NSMQ AI Grand Challenge, an AI Grand Challenge for Education using Ghana's National Science and Maths Quiz competition (NSMQ) as a case study. Our proposed grand challenge is to "Build an AI to compete live in Ghana's National Science and Maths Quiz (NSMQ) competition and win -performing better than the best contestants in all rounds and stages of the competition." We describe the competition, and key technical challenges to address along with ideas from recent advances in machine learning that could be leveraged to solve this challenge. This position paper is a first step towards conquering such a challenge and importantly, making advances in AI for education in the African context towards democratizing high-quality education across Africa.

show abstract

Section: Speech Outputmentioning

confidence: 99%

Can an AI Win Ghana's National Science and Maths Quiz? An AI Grand Challenge for Education

Boateng¹,

Kumbol²,

Kaufmann³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Generative Models Incorporating other generative models in Ubiq-Genie, beyond the employed image and text synthesis models, could lead to interesting applications. Potential models to be integrated could be capable of synthesising 3D models from text or images such as Point-E [16], personalised speech from text such as VALL-E [24], or audio from text or images such as Make-An-Audio [9] and MusicLM [1]. In addition, the currently implemented services could be expanded to build more advanced types of applications and experiences.…”

Section: Services and Applicationsmentioning

confidence: 99%

Ubiq-Genie: Leveraging External Frameworks for Enhanced Social VR Experiences

Numan

Giunchi

Congdon

et al. 2023

2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)

View full text Add to dashboard Cite

show abstract

“…VALL-E (Wang et al, 2023) instead relies on a hybrid approach, where the tokens corresponding to the first RVQ level are predicted autoregressively, and the subsequent levels are produced non-autoregressively. The latter is achieved by a model that sums up the embeddings from the same RVQ input frame, and applies bidirectional self-attention to predict all tokens from RVQ level q + 1 given all tokens from levels 1, .…”

Section: Related Workmentioning

confidence: 99%

“…Modeling discrete representations of audio produced by neural codecs (Zeghidour et al, 2022;Défossez et al, 2022) makes the task of audio generation amenable to the powerful Transformer-based sequence-to-sequence modeling approaches (Vaswani et al, 2017). Casting unconditional and conditional audio generation as sequence-to-sequence modeling has unlocked rapid progress in speech continuation (Borsos et al, 2022), text-to-speech (Wang et al, 2023;Kharitonov et al, 2023), and general audio and music generation (Kreuk et al, 2022;Agostinelli et al, 2023).…”

Section: Introductionmentioning

confidence: 99%

“…The problem of generating long audio token sequences can be addressed by at least three orthogonal approaches, or a combination thereof: i) efficient attention mechanisms (Kitaev et al, 2020;Choromanski et al, 2021;Xiong et al, 2021;Hawthorne et al, 2022), ii) non-autoregressive, parallel decoding schemes (Gu et al, 2017;Ghazvininejad et al, 2019;Chang et al, 2022), iii) custom architectures adapted to the special structure of the tokens produced by neural audio codecs (Kreuk et al, 2022;Wang et al, 2023;Lee et al, 2022). However, in the context of modeling the token sequence of neural audio codecs, either unconditionally or based on weak conditioning such as text, the efficient generation of long, high-quality audio segments remains an open problem.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SpeechPainter: Text-conditioned Speech Inpainting

Borsos¹,

Sharifi²,

Tagliasacchi³

2022

Interspeech 2022

View full text Add to dashboard Cite

We present SoundStorm, a model for efficient, non-autoregressive audio generation. Sound-Storm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices. Audio samples are available at https://google-research.github. io/seanet/soundstorm/examples/

show abstract

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Cited by 25 publications

References 9 publications

Can an AI Win Ghana's National Science and Maths Quiz? An AI Grand Challenge for Education

Can an AI Win Ghana's National Science and Maths Quiz? An AI Grand Challenge for Education

Ubiq-Genie: Leveraging External Frameworks for Enhanced Social VR Experiences

SpeechPainter: Text-conditioned Speech Inpainting

Contact Info

Product

Resources

About