ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Ping, Wei; Peng, Kainan; Chen, Jitong

doi:10.48550/arxiv.1807.07281

Cited by 111 publications

(65 citation statements)

References 14 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…ClariNet [69] is also a vocoder that employs knowledge distillation [36]. However, the training process with distillation-based methods remain problems.…”

Section: Vocodersmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

Shi

2021

Preprint

View full text Add to dashboard Cite

With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing. In this paper, we conduct a survey on audio synthesis and audio-visual multimodal processing, which helps understand current research and future trends. This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information. The corresponding technical methods are comprehensively classified and introduced, and their future development trends are prospected. This survey can provide some guidance for researchers who are interested in the areas like audio synthesis and audio-visual multimodal processing. introductionAudio synthesis, which aims to synthesis various form of natural and intelligible sound such as speech, music, has a wide range of application scenario in human society and industry. Initially, researchers took advantages of pure signal processing methods to find some convenient representations for audio, which can be easily modelled and transform to temporal audio. For example, short-time Fourier Transform(STFT) is an efficient way to convert audio into frequency domain and Griffin-Lim [31] is a kind of pure signal processing algorithm that is able to decode STFT sequence to temporal waveform. Methods similar to Griffin-Lim are WORLD [62], etc. In recent years, with the rapid development of deep learning technology, researchers have begun to build deep neural networks for audio synthesis and other multimodal tasks in order to simplify the pipeline and improve the performance of the model. Numerous neural network models have emerged so far for the tasks such as text to speech(TTS) and music generation. There are a lot of models for TTS that have been reported, such as Parallel WaveGAN [103], MelGAN [45], FastSpeech2/2s [80], EATs [21], VITS [40]. Simultaneously, there are many models for music generation like song from PI MuseGAN [23] and Jukebox [18]. These models bring great convenience to human production and life, and they provide key reference for future research.Vision is a physiological word. Humans and animals visually perceive the size, brightness, color, etc. of external objects, and obtain information that is essential for survival. Vision is the most important sense for human beings. Over these years, deep learning has been widely explored in various image processing and computer vision tasks such as image dehazing/deraining, objective detection and image segmentation, which contribute to the development of social productivity. Image dehazing/deraining means given a blurred image with haze/rain, algorithms are used to remove the haze/rain in the image to make it clear. [95,46,100,47,99,54,98,96,48] proposed neural network-based models for image dehazing/deraining respectively. Objective detection means finding out all ...

show abstract

“…ClariNet [69] is also a vocoder that employs knowledge distillation [36]. However, the training process with distillation-based methods remain problems.…”

Section: Vocodersmentioning

confidence: 99%

“…Autoregressive WaveNet [66],SampleRNN [57] DeepVoice [2],LPCNet [89] Non-autoregressive WaveGlow [72],FloWaveNet [41] WaveFlow [70],Parallel WaveNet [65] ClariNet [69],WaveGAN [20] Parallel WaveGAN [103],MelGAN [45] GAN-TTS [5],HiFi-GAN [44] End-to-End Char2Wav [87],Fastspeech 2s [80] EATs [21],VITS [40] Figure 2: A taxonomy of TTS.…”

Section: Acoustic Modelsmentioning

confidence: 99%

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

Shi

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The second stage is to synthesize the raw waveform audio from the predicted intermediate representation [17], [18], [19], [20], [21]. In order to simplify the TTS system in terms of training and deployment, end-to-end TTS models have been proposed [22], [23], [24]. However, for the talking head generation task, the intermediate representations of the two-stage approach are useful.…”

Section: A Text-to-speech Synthesismentioning

confidence: 99%

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Wang,

Xie,

Zhu

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. In contrast to previous text-driven talking head generation methods, which can only synthesize the voice of a specific person, the proposed method is capable of synthesizing speech for any person that are inaccessible in the training stage. Specifically, the proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-tospeech (TTS) stage and a speech-driven talking head generation stage. The proposed TTS module is a face-conditioned multispeaker TTS model that gets the speaker identity information from face images instead of speech, which allows us to synthesize a personalized voice on the basis of the input face image. To generate the talking head videos from the face images, a facial landmark-based method that can predict both lip movements and head rotations is proposed. Extensive experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and nonpersons. Synthesized speech shows consistency with the given face regarding to the synthesized voice's timbre and one's appearance in the image, and the proposed landmark-based talking head method outperforms the state-of-the-art landmark-based method on generating natural talking head videos.

show abstract

“…In addition, other recent AR models, including sampleRNN [17] and LPCNet [33] have further improved the sound quality. However, due to the large amount of computation and the slow generation speed, researchers currently mainly focus on developing non-AR wave generation models, such as Parallel WaveNet [20], ClariNet [21], GanSynth [5], FloWaveNet [11], MelGan [15], WaveGlow [24], Parallel WaveGan [37], and so on.…”

Section: Introductionmentioning

confidence: 99%

ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

Wu¹,

Shi²

2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called ItôTTS, and the model that generates wave is called ItôWave. ItôTTS and ItôWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of original text or mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of ItôTTS and ItôWave can exceed the current state-of-the-art methods, reached 3.925±0.160 and 4.35±0.115 respectively.Preprint. Under review.

show abstract

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Cited by 111 publications

References 14 publications

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

Contact Info

Product

Resources

About