Yuan-Hao Yi scite author profile

The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.

Yuan-Hao Yi

5Publications

48Citation Statements Received

71Citation Statements Given

How they've been cited

How they cite others

Affiliations

Microsoft Research Asia (China), University of Science and Technology of China, Microsoft Research (United Kingdom)

Publications

Order By: Most citations

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Ling

et al. 2019

View full text Add to dashboard Cite

This paper presents a method of using autoregressive neural networks for the acoustic modeling of singing voice synthesis (SVS). Singing voice differs from speech and it contains more local dynamic movements of acoustic features, e.g., vibratos. Therefore, our method adopts deep autoregressive (DAR) models to predict the F0 and spectral features of singing voice in order to better describe the dependencies among the acoustic features of consecutive frames. For F0 modeling, discretized F0 values are used and the influences of the history length in DAR are analyzed by experiments. An F0 post-processing strategy is also designed to alleviate the inconsistency between the predicted F0 contours and the F0 values determined by music notes. Furthermore, we extend the DAR model to deal with continuous spectral features, and a prenet module with self-attention layers is introduced to process historical frames. Experiments on a Chinese singing voice corpus demonstrate that our method using DARs can produce F0 contours with vibratos effectively, and can achieve better objective and subjective performance than the conventional method using recurrent neural networks (RNNs).

show abstract

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Xu¹,

Chen²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves −0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.

show abstract

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Ling

et al. 2019

Preprint

View full text Add to dashboard Cite

Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech

Pan

Wang

et al. 2022

View full text Add to dashboard Cite

Spoken digit recognition using URAN (universally reconstructable artificial neural-network) VLSI chip

Kim

Han²,

Lee³

et al.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yuan-Hao Yi

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech

Spoken digit recognition using URAN (universally reconstructable artificial neural-network) VLSI chip

Contact Info

Product

Resources

About