ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053944
|View full text |Cite
|
Sign up to set email alerts
|

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Abstract: We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
47
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 45 publications
(49 citation statements)
references
References 17 publications
0
47
0
Order By: Relevance
“…The RNN encoder employs bi-directional LSTM. Following [18], the GLU blocks are convolutional modules conditioned on local contexts. The conformer is introduced for automatic speech recognition in [32], which is a combination of MHSA and convolution mechanism.…”
Section: Sequence-to-sequence Svsmentioning
confidence: 99%
See 1 more Smart Citation
“…The RNN encoder employs bi-directional LSTM. Following [18], the GLU blocks are convolutional modules conditioned on local contexts. The conformer is introduced for automatic speech recognition in [32], which is a combination of MHSA and convolution mechanism.…”
Section: Sequence-to-sequence Svsmentioning
confidence: 99%
“…As there is no official split of the "Kiritan" database, we use 48 songs for training, 1 for validation, and 1 for testing. Follow previous works [6,18], we split each song of several minutes of singing into phrases, resulting in 467 phrases for training, 18 for validation, and 10 for testing. The splitting is based on the silence between lyrics.…”
Section: Datasetmentioning
confidence: 99%
“…In singing synthesis, several works aim to go towards a reduction in the burden of dataset annotation. In particu- lar, sequence-to-sequence models generally avoid the need of detailed phonetic segmentation, but do require a fairly well aligned musical score with lyrics [2,3,4,5,6,7,8]. Similarly voice cloning techniques require only a small amount of training data with phonetic segmentation for the target voice (e.g.…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Singing synthesis has recently seen a notable uptick in research activity, and, inspired by modern deep learning techniques developed for text-to-speech (TTS), great strides have been made, e.g. [1,2,3,4,5,6,7,8]. To create a new voice for these models, generally a supervised approach is used, meaning that besides recordings of the target singer, phonetic segmentation or a reasonably well-aligned score with lyrics is needed.…”
Section: Introductionmentioning
confidence: 99%
“…Different kinds of models have been utilized and investigated for ML frameworks. These models include neural networks, decision trees, regression analysis and have a massive application that includes speech and object recognition [8][9][10][11][12][13][14][15]. The scope of this paper is focused on neural networks and their subsets, particularly neural networks and sequence-to-sequence learning.…”
Section: Introductionmentioning
confidence: 99%