EdiTTS: Score-based Editing for Controllable Text-to-Speech

Tae, Jaesung; Kim, Hyeongju; Kim, Taesu

doi:10.48550/arxiv.2110.02584

Cited by 3 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a parallel development, Diffsound [76] employs a non-autoregressive decoder based on discrete diffusion models [79], a novel approach that allows for the simultaneous prediction of all mel-spectrogram tokens, followed by successive refinements. Another innovative approach is EdiTTS [56], which utilizes a score-based text-to-speech model to refine a mel-spectrogram that has been coarsely altered, ensuring greater control and precision in the generation process.…”

Section: Text-to-audio Generationmentioning

confidence: 99%

“…Subsequently, we will systematically review the applications of diffusion models across various fields (Section 5), including computer vision [11][12][13][14][15][16][17][18][35][36][37][38], multi-modal generation [10,19,20,[39][40][41][42][43][44][45][46][47][48][49][50], and interdisciplinary fields [51][52][53][54][55][56]. For each application, we will define the task and discuss how diffusion models offer solutions to challenges identified in previous works.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review

Wang,

He,

Peng

2024

Mathematics

View full text Add to dashboard Cite

Diffusion models have swiftly taken the lead in generative modeling, establishing unprecedented standards for producing high-quality, varied outputs. Unlike Generative Adversarial Networks (GANs)—once considered the gold standard in this realm—diffusion models bring several unique benefits to the table. They are renowned for generating outputs that more accurately reflect the complexity of real-world data, showcase a wider array of diversity, and are based on a training approach that is comparatively more straightforward and stable. This survey aims to offer an exhaustive overview of both the theoretical underpinnings and practical achievements of diffusion models. We explore and outline three core approaches to diffusion modeling: denoising diffusion probabilistic models, score-based generative models, and stochastic differential equations. Subsequently, we delineate the algorithmic enhancements of diffusion models across several pivotal areas. A notable aspect of this review is an in-depth analysis of leading generative models, examining how diffusion models relate to and evolve from previous generative methodologies, offering critical insights into their synergy. A comparative analysis of the merits and limitations of different generative models is a vital component of our discussion. Moreover, we highlight the applications of diffusion models across computer vision, multi-modal generation, and beyond, culminating in significant conclusions and suggesting promising avenues for future investigation.

show abstract

Section: Text-to-audio Generationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review

Wang,

He,

Peng

2024

Mathematics

View full text Add to dashboard Cite

show abstract

“…Di-lated convolution increases the receptive field without changing the size of the image output feature map. The operations of ordinary convolution and dilated convolution [44] are Eqs ( 7) and ( 8):…”

Section: Dilated Convolutionmentioning

confidence: 99%

Diffusion Model for Multi-scale Ship Object Detection and Recognition in Remote Sensing Images

Chen,

Wang,

Liu

et al. 2024

Preprint

View full text Add to dashboard Cite

Ship object detection and recognition in remote sensing images (RSIs) is a challenging task due to the multi-scale and complex background characteristics of ship objects. Currently, convolution-based methods cannot adequately solve these problems. Firstly, this article first applies the diffusion model to the task of ship object detection and recognition in RSIs, and proposes a new diffusion model for multi-scale ship object detection and recognition in remote sensing images (MSDiffDet). Secondly, in order to reduce the loss of multi-scale information in the feature extraction process, this article proposes the Channel Fusion FPN (CF-FPN) based on FPN and constructs the Large-Scale Feature Enhancement Module (LSFEM), which further enhances the algorithm's ability to extract large-scale ship object features and improves the detection accuracy of ship objects in RSIs. Finally, this article prunes and reconstructs MobileNetV2 to obtain the Sparse MobileNetV2, which is used as the backbone network of the image encoder, which enhances detection accuracy while reducing the overall parameter count of the algorithm. The experimental results demonstrate that the MSDiffDet algorithm is effective in detecting and recognizing four types of remote sensing ship objects: aircraft carriers, warships, commercial ships, and submarines. The mAP0.5 achieved a notable 89.8\%. A significant improvement of 5.8\% in mAP0.5 is observed compared to the DiffusionDet algorithm, indicating the potential of the MSDiffDet algorithm for applications in remote sensing ship object detection and recognition.

show abstract

“…Denoising diffusion models [60,64] have seen great success on a wide variety of different challenges, ranging from image2image translation tasks like inpainting, colorisation, image upscaling, uncropping [6,26,41,42,50,53,57,59], audio generation [11,28,33,35,38,48,67,80], text-based image generation [4,21,23,46,51,55,58], video generation [24,27,82,86], and many others. For a thorough review on diffusion models and all of their recent applications, we recommend [81].…”

Section: Diffusion Modelsmentioning

confidence: 99%

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Bigioi¹,

Basak²,

Jordan³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing. 1

show abstract

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Cited by 3 publications

References 7 publications

Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review

Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review

Diffusion Model for Multi-scale Ship Object Detection and Recognition in Remote Sensing Images

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Contact Info

Product

Resources

About