2021
DOI: 10.48550/arxiv.2110.02584
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 7 publications
0
3
0
Order By: Relevance
“…In a parallel development, Diffsound [76] employs a non-autoregressive decoder based on discrete diffusion models [79], a novel approach that allows for the simultaneous prediction of all mel-spectrogram tokens, followed by successive refinements. Another innovative approach is EdiTTS [56], which utilizes a score-based text-to-speech model to refine a mel-spectrogram that has been coarsely altered, ensuring greater control and precision in the generation process.…”
Section: Text-to-audio Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…In a parallel development, Diffsound [76] employs a non-autoregressive decoder based on discrete diffusion models [79], a novel approach that allows for the simultaneous prediction of all mel-spectrogram tokens, followed by successive refinements. Another innovative approach is EdiTTS [56], which utilizes a score-based text-to-speech model to refine a mel-spectrogram that has been coarsely altered, ensuring greater control and precision in the generation process.…”
Section: Text-to-audio Generationmentioning
confidence: 99%
“…Subsequently, we will systematically review the applications of diffusion models across various fields (Section 5), including computer vision [11][12][13][14][15][16][17][18][35][36][37][38], multi-modal generation [10,19,20,[39][40][41][42][43][44][45][46][47][48][49][50], and interdisciplinary fields [51][52][53][54][55][56]. For each application, we will define the task and discuss how diffusion models offer solutions to challenges identified in previous works.…”
Section: Introductionmentioning
confidence: 99%
“…Di-lated convolution increases the receptive field without changing the size of the image output feature map. The operations of ordinary convolution and dilated convolution [44] are Eqs ( 7) and ( 8):…”
Section: Dilated Convolutionmentioning
confidence: 99%
“…Denoising diffusion models [60,64] have seen great success on a wide variety of different challenges, ranging from image2image translation tasks like inpainting, colorisation, image upscaling, uncropping [6,26,41,42,50,53,57,59], audio generation [11,28,33,35,38,48,67,80], text-based image generation [4,21,23,46,51,55,58], video generation [24,27,82,86], and many others. For a thorough review on diffusion models and all of their recent applications, we recommend [81].…”
Section: Diffusion Modelsmentioning
confidence: 99%