The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1563
|View full text |Cite
|
Sign up to set email alerts
|

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Abstract: This paper presents a method of using autoregressive neural networks for the acoustic modeling of singing voice synthesis (SVS). Singing voice differs from speech and it contains more local dynamic movements of acoustic features, e.g., vibratos. Therefore, our method adopts deep autoregressive (DAR) models to predict the F0 and spectral features of singing voice in order to better describe the dependencies among the acoustic features of consecutive frames. For F0 modeling, discretized F0 values are used and th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(26 citation statements)
references
References 19 publications
(27 reference statements)
0
25
0
Order By: Relevance
“…4) Recent Progress on Neural Vocoders: More recently, speaker independent WaveRNN-based neural vocoder [63] became popular as it can generate human-like voices from both in-domain and out-of-domain spectrogram [101]- [103]. Another well-known neural vocoder that achieves high-quality synthesis performance is WaveGlow [64].…”
Section: A Speech Analysis and Reconstructionmentioning
confidence: 99%
“…4) Recent Progress on Neural Vocoders: More recently, speaker independent WaveRNN-based neural vocoder [63] became popular as it can generate human-like voices from both in-domain and out-of-domain spectrogram [101]- [103]. Another well-known neural vocoder that achieves high-quality synthesis performance is WaveGlow [64].…”
Section: A Speech Analysis and Reconstructionmentioning
confidence: 99%
“…Non-Seq2Seq singing synthesizers include those based on autoregressive architectures [17,21,22], feed-forward CNN [23], and feed-forward GAN-based approaches [24,25].…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Lyrics-to-singing alignment is important for SVS to decide how long each phoneme is sung in synthesized voices. Previous works [21,42] usually leverage human labeling to split songs into sentences and then conduct phoneme alignment within each sentence by leveraging an HMM (hidden markov model) based speech recognition model. In this paper, we propose a new alignment model to extract the duration of each phoneme, by leveraging raw lyrics and song recordings, without relying on any human labeling efforts.…”
Section: Lyrics-to-singing Alignmentmentioning
confidence: 99%