Martí Umbert scite author profile

Martí Umbert

4Publications

82Citation Statements Received

87Citation Statements Given

How they've been cited

How they cite others

Affiliations

Pompeu Fabra University, Telefonica Research and Development

Publications

Order By: Most citations

Expressive Singing Synthesis Based on Unit Selection for the Singing Synthesis Challenge 2016

Bonada

Umbert

Blaauw

2016

View full text Add to dashboard Cite

Sample and statistically based singing synthesizers typically require a large amount of data for automatically generating expressive synthetic performances. In this paper we present a singing synthesizer that using two rather small databases is able to generate expressive synthesis from an input consisting of notes and lyrics. The system is based on unit selection and uses the Wide-Band Harmonic Sinusoidal Model for transforming samples. The first database focuses on expression and consists of less than 2 minutes of free expressive singing using solely vowels. The second one is the timbre database which for the English case consists of roughly 35 minutes of monotonic singing of a set of sentences, one syllable per beat. The synthesis is divided in two steps. First, an expressive vowel singing performance of the target song is generated using the expression database. Next, this performance is used as input control of the synthesis using the timbre database and the target lyrics. A selection of synthetic performances have been submitted to the Interspeech Singing Synthesis Challenge 2016, in which they are compared to other competing systems.

show abstract

Expression Control in Singing Voice Synthesis: Features, approaches, evaluation, and challenges

Umbert

Bonada

Goto

et al. 2015

IEEE Signal Process. Mag.

View full text Add to dashboard Cite

In the context of singing voice synthesis, expression control manipulates a set of voice features related to a particular emotion, style, or singer. Also known as performance modeling, it has been approached from different perspectives and for different purposes, and different projects have shown a wide extent of applicability. The aim of this article is to provide an overview of approaches to expression control in singing voice synthesis. Section I introduces some musical applications that use singing voice synthesis techniques to justify the need for an accurate control of expression. Then, expression is defined and related to speech and instrument performance modeling. Next, Section II presents the commonly studied set of voice parameters that can change perceptual aspects of synthesized voices. Section III provides, as the main topic of this review, an up-to-date classification, comparison, and description of a selection of approaches to expression control. Then, Section IV describes how these approaches are currently evaluated and discusses the benefits of building a common evaluation framework and adopting perceptually-motivated objective measures. Finally, Section V discusses the challenges that we currently foresee. Table 1: Research projects using singing voice synthesis technologies.

show abstract

Automatic Speech Feature Learning for Continuous Prediction of Customer Satisfaction in Contact Center Phone Calls

Segura

Balcells

Umbert

et al. 2016

View full text Add to dashboard Cite

Speech related processing tasks have been commonly tackled using engineered features, also known as hand-crafted descriptors. These features have usually been optimized along years by the research community that constantly seeks for the most meaningful, robust, and compact audio representations for the specific domain or task. In the last years, a great interest has arisen to develop architectures that are able to learn by themselves such features, thus bypassing the required engineering effort. In this work we explore the possibility to use Convolutional Neural Networks (CNN) directly on raw audio signals to automatically learn meaningful features. Additionally, we study how well do the learned features generalize for a different task. First, a CNN-based continuous conflict detector is trained on audios extracted from televised political debates in French. Then, while keeping previous learned features, we adapt the last layers of the network for targeting another concept by using completely unrelated data. Concretely, we predict self-reported customer satisfaction from call center conversations in Spanish. Reported results show that our proposed approach, using raw audio, obtains similar results than those of a CNN using classical Mel-scale filter banks. In addition, the learning transfer from the conflict detection task into satisfaction prediction shows a successful generalization of the learned features by the deep architecture.

show abstract

The Role of Linguistic and Prosodic Cues on the Prediction of Self-Reported Satisfaction in Contact Centre Phone Calls

Luque¹,

Segura²,

Sánchez³

et al. 2017

View full text Add to dashboard Cite

Call Centre data is typically collected by organizations and corporations in order to ensure the quality of service, supporting for example mining capabilities for monitoring customer satisfaction. In this work, we analyze the significance of various acoustic features extracted from customer-agents' spoken interaction in predicting self-reported satisfaction by the customer. We also investigate whether speech prosodic features can deliver complementary information to speech transcriptions provided by an ASR. We explore the possibility of using a deep neural architecture to perform early feature fusion on both prosodic and linguistic information. Convolutional Neural Networks are trained on a combination of word embedding and acoustic features for the binary classification task of "low" and "high" satisfaction prediction. We conducted our experiments analysing real callcentre interactions of a large corporation in a Spanish spoken country. Our experiments show that linguistic features can predict self-reported satisfaction more accurately than those based on prosodic and conversational descriptors. We also find that dialog turn-level conversational features generally outperforms frame-level signal descriptors. Finally, the fusion of linguistic and prosodic features reports the best performance in our experiments, suggesting the complementarity of the information conveyed by each set of behavioral representation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Martí Umbert

Expressive Singing Synthesis Based on Unit Selection for the Singing Synthesis Challenge 2016

Expression Control in Singing Voice Synthesis: Features, approaches, evaluation, and challenges

Automatic Speech Feature Learning for Continuous Prediction of Customer Satisfaction in Contact Center Phone Calls

The Role of Linguistic and Prosodic Cues on the Prediction of Self-Reported Satisfaction in Contact Centre Phone Calls

Contact Info

Product

Resources

About