Expression Control in Singing Voice Synthesis: Features, approaches, evaluation, and challenges

Umbert, Martí; Bonada, Jordi; Goto, Masataka; Nakano, Takanori; Sundberg, Johan

doi:10.1109/msp.2015.2424572

Cited by 33 publications

(31 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We list errors for note onsets, offsets and consonant durations separately to ensure the fitting heuristic affects the results only minimally. • F0 metrics: Standard F0 metrics such as RMSE are given, but it should be noted that these metrics are often not very correlated to perceptual metrics in singing [41]. For instance, starting a vibrato slightly early or late compared to the reference may be equally valid musically, but can the cause the two F0 contours to become out of phase, resulting in high distances.…”

mentioning

confidence: 99%

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

2017

Self Cite

View full text Add to dashboard Cite

Abstract:We recently presented a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Nonetheless, compared to modeling waveform directly, ways of effectively handling higher-dimensional outputs, multiple feature streams and regularization become more important with our approach. In this work, we extend our proposed system to include additional components for predicting F0 and phonetic timings from a musical score with lyrics. These expression-related features are learned together with timbrical features from a single set of natural songs. We compare our method to existing statistical parametric, concatenative, and neural network-based approaches using quantitative metrics as well as listening tests.

show abstract

mentioning

confidence: 99%

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Expression control in singing synthesis, also known as performance modelling, consists in the manipulation of a set of voice features (e.g., phonetic timing, pitch contour, vibrato, timbre) that relates to a particular emotion, style, or singer [41]. Accordingly, the expression control generation module provides the duration, F0, and spectral controls required by the transformation module to convert the sequence of speech parameters into singing parameters.…”

Section: Expression Control Generationmentioning

confidence: 99%

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Freixes

Álías

Socoró

2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, timescale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.

show abstract

“…25. Note that as in natural voices, the vowel identity tends to disappear for high pitch, with all vowels becoming close to each other [sound example in Additional files 7 and 8] 9 .…”

Section: First Formant Tuningmentioning

confidence: 99%

“…The main advantage of parametric synthesis is its flexibility and economy in terms of memory and computational load. The next generation of voice synthesis systems was based on recording, concatenation, and modification of real voice samples 4 or statistical parametric synthesis [9]. A formant synthesizer is preferred for Cantor Digitalis because flexibility and real time are the main issues for performative singing synthesis.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cantor Digitalis: chironomic parametric synthesis of singing

Feugère

d’Alessandro

Doval

et al. 2017

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Cantor Digitalis is a performative singing synthesizer that is composed of two main parts: a chironomic control interface and a parametric voice synthesizer. The control interface is based on a pen/touch graphic tablet equipped with a template representing vocalic and melodic spaces. Hand and pen positions, pen pressure, and a graphical user interface are assigned to specific vocal controls. This interface allows for real-time accurate control over high-level singing synthesis parameters. The sound generation system is based on a parametric synthesizer that features a spectral voice source model, a vocal tract model consisting of parallel filters for vocalic formants and cascaded with anti-resonance for the spectral effect of hypo-pharynx cavities, and rules for parameter settings and source/filter dependencies between fundamental frequency, vocal effort, and formants. Because Cantor Digitalis is a parametric system, every aspect of voice quality can be controlled (e.g., vocal tract size, aperiodicities in the voice source, vowels, and so forth). It offers several presets for different voice types. Cantor Digitalis has been played on stage in several public concerts, and it has also been proven to be useful as a tool for voice pedagogy. The aim of this article is to provide a comprehensive technical overview of Cantor Digitalis.

show abstract

Expression Control in Singing Voice Synthesis: Features, approaches, evaluation, and challenges

Cited by 33 publications

References 26 publications

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Cantor Digitalis: chironomic parametric synthesis of singing

Contact Info

Product

Resources

About