Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Blaauw, Merlijn; Bonada, Jordi

doi:10.1109/icassp40776.2020.9053944

Cited by 45 publications

(49 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The RNN encoder employs bi-directional LSTM. Following [18], the GLU blocks are convolutional modules conditioned on local contexts. The conformer is introduced for automatic speech recognition in [32], which is a combination of MHSA and convolution mechanism.…”

Section: Sequence-to-sequence Svsmentioning

confidence: 99%

“…As there is no official split of the "Kiritan" database, we use 48 songs for training, 1 for validation, and 1 for testing. Follow previous works [6,18], we split each song of several minutes of singing into phrases, resulting in 467 phrases for training, 18 for validation, and 10 for testing. The splitting is based on the silence between lyrics.…”

Section: Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss

Shi

Guo

Huo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation cost, . In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models, including the RNN-based, transformer-based, and conformer-based models. Our experiments show that the PE loss can mitigate the over-fitting problem and significantly improve the synthesized singing quality reflected in objective and subjective evaluations.

show abstract

Section: Sequence-to-sequence Svsmentioning

confidence: 99%

Section: Datasetmentioning

confidence: 99%

Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss

Shi

Guo

Huo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In singing synthesis, several works aim to go towards a reduction in the burden of dataset annotation. In particu- lar, sequence-to-sequence models generally avoid the need of detailed phonetic segmentation, but do require a fairly well aligned musical score with lyrics [2,3,4,5,6,7,8]. Similarly voice cloning techniques require only a small amount of training data with phonetic segmentation for the target voice (e.g.…”

Section: Relation To Prior Workmentioning

confidence: 99%

“…Singing synthesis has recently seen a notable uptick in research activity, and, inspired by modern deep learning techniques developed for text-to-speech (TTS), great strides have been made, e.g. [1,2,3,4,5,6,7,8]. To create a new voice for these models, generally a supervised approach is used, meaning that besides recordings of the target singer, phonetic segmentation or a reasonably well-aligned score with lyrics is needed.…”

Section: Introductionmentioning

confidence: 99%

Semi-Supervised Learning for Singing Synthesis Timbre

Bonada

Blaauw

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only, without any annotations such as phonetic segmentation. Our system is an encoder-decoder model with two encoders, linguistic and acoustic, and one (acoustic) decoder. In a first step, the system is trained in a supervised manner, using a labeled multi-singer dataset. Here, we ensure that the embeddings produced by both encoders are similar, so that we can later use the model with either acoustic or linguistic input features. To learn a new voice in an unsupervised manner, the pretrained acoustic encoder is used to train a decoder for the target singer. Finally, at inference, the pretrained linguistic encoder is used together with the decoder of the new voice, to produce acoustic features from linguistic input. We evaluate our system with a listening test and show that the results are comparable to those obtained with an equivalent supervised approach.

show abstract

“…Different kinds of models have been utilized and investigated for ML frameworks. These models include neural networks, decision trees, regression analysis and have a massive application that includes speech and object recognition [8][9][10][11][12][13][14][15]. The scope of this paper is focused on neural networks and their subsets, particularly neural networks and sequence-to-sequence learning.…”

Section: Introductionmentioning

confidence: 99%

A systematic review on sequence-to-sequence learning with neural network and its models

Yousuf

Lahzi

Salloum

et al. 2021

IJECE

View full text Add to dashboard Cite

We develop a precise writing survey on sequence-to-sequence learning with neural network and its models. The primary aim of this report is to enhance the knowledge of the sequence-to-sequence neural network and to locate the best way to deal with executing it. Three models are mostly used in sequence-to-sequence neural network applications, namely: recurrent neural networks (RNN), connectionist temporal classification (CTC), and attention model. The evidence we adopted in conducting this survey included utilizing the examination inquiries or research questions to determine keywords, which were used to search for bits of peer-reviewed papers, articles, or books at scholastic directories. Through introductory hunts, 790 papers, and scholarly works were found, and with the assistance of choice criteria and PRISMA methodology, the number of papers reviewed decreased to 16. Every one of the 16 articles was categorized by their contribution to each examination question, and they were broken down. At last, the examination papers experienced a quality appraisal where the subsequent range was from 83.3% to 100%. The proposed systematic review enabled us to collect, evaluate, analyze, and explore different approaches of implementing sequence-to-sequence neural network models and pointed out the most common use in machine learning. We followed a methodology that shows the potential of applying these models to real-world applications.

show abstract

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Cited by 45 publications

References 17 publications

Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss

Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss

Semi-Supervised Learning for Singing Synthesis Timbre

A systematic review on sequence-to-sequence learning with neural network and its models

Contact Info

Product

Resources

About