BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Leng, Yichong; Chen, Zehua; Guo, Junliang; Liu, Haohe; Chen, Jiawei; Xu, Tao; Mandic, Danilo P.; Li, He; Li, Xiangyang; Qin, Tao; Zhao, S. J.; Liu, Tie-Yan

doi:10.48550/arxiv.2205.14807

Cited by 3 publications

(4 citation statements)

References 19 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One major drawback of DPM-based models is the slow sampling speed due to many iterative steps. Therefore, many previous DPM-based TTS methods focus on accelerating the sampling method to boost the inference speed [38], [39], [56], [57], [58]. Some research considers changing the training process to generate high-quality speech.…”

Section: Dpm-based Ttsmentioning

confidence: 99%

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin

Li,

Hu,

Cong

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra-and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotiondiscriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotiondiscriminative embedding.

show abstract

Section: Dpm-based Ttsmentioning

confidence: 99%

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin

Li,

Hu,

Cong

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Denoising diffusion probabilistic models (DDPMs) (Ho et al, 2020) have demonstrated their great generation potential on various applications, such as text-to-image synthesis (Poole et al, 2022;Gu et al, 2022;Kim & Ye, 2021), image inpainting (Lugmayr et al, 2022;Liu et al, 2022;Kawar et al, 2022), speech synthesis (Huang et al, 2021;Lam et al, 2022;Leng et al, 2022), and molecular conformation generation (Hoogeboom et al, 2022;Jing et al, 2022;Wu et al, 2022;Huang et al, 2022). It involves a diffusion process to gradually add noise to data, and a parameterized denoising process to reverse the diffusion process, sampling through gradually removing the noise from random noise.…”

Section: Denoising Diffusion Probabilistic Modelsmentioning

confidence: 99%

“…In recent years, denoising diffusion probabilistic models (DDPMs) (Ho et al, 2020) have been proven to have potential in data generation tasks such as text-to-image generation (Poole et al, 2022;Gu et al, 2022;Kim & Ye, 2021;Chen et al, 2022), speech synthesis (Huang et al, 2021;Lam et al, 2022;Leng et al, 2022), and molecular conformation formation (Hoogeboom et al, 2022;Jing et al, 2022;Wu et al, 2022;Huang et al, 2022). They build a diffusion process to add noise into the sample and a denoising process to remove noise from the sample gradually.…”

Section: Introductionmentioning

confidence: 99%

ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models

Li¹,

Liu²,

Chai³

et al. 2023

Preprint

View full text Add to dashboard Cite

Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous sampling methods with fixed analytical form are not robust with the error in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise. Experiments on Cifar10, LSUN-Church, and LSUN-Bedroom datasets demonstrate that our proposed ERA-Solver achieves 5.14, 9.42, and 9.69 Fenchel Inception Distance (FID) for image generation, with only 10 network evaluations.

show abstract

“…In our approach, we decided that rather than relying on visual cues, allowing direct customization of the output 3D signal with an interactive interface would allow for more freedom and generate more satisfactory results. This paper presented a synthesis process in order to generate binaural audio from a mono audio source, also similar to what we are trying to accomplish [10]. Instead of utilizing head-related transfer functions, they created a novel process for generating binaural audio utilizing diffusion models.…”

Section: Related Workmentioning

confidence: 99%

An Context-Aware Intelligent System to Automate the Conversion of 2D Audio to 3D Audio using Signal Processing and Machine Learning

Gao¹,

Sun²

2022

Artificial Intelligence and Fuzzy Logic System

View full text Add to dashboard Cite

As virtual reality technologies emerge, the ability to create immersive experiences visually drastically improved [1]. However, in order to accompany the visual immersion, audio must also become more immersive [2]. This is where 3D audio comes in. 3D audio allows for the simulation of sounds from specific directions, allowing a more realistic feeling [3]. At the present moment, there lacks sufficient tools for users to design immersive audio experiences that fully exploit the abilities of 3D audio. This paper proposes and implements the following systems [4]: 1. Automatic separation of stems from the incoming audio file, or letting the user upload the stems themselves 2. A simulated environment in which the separated stems will be automatically placed in 3. A user interface in order to manipulate the simulated positions of the separated stems. We applied our application to a few selected audio files in order to conduct a qualitative evaluation of our approach. The results show that our approach was able to successfully separate the stems and simulate a dimensional sound effect.

show abstract

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Cited by 3 publications

References 19 publications

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin

ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models

An Context-Aware Intelligent System to Automate the Conversion of 2D Audio to 3D Audio using Signal Processing and Machine Learning

Contact Info

Product

Resources

About