Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Tang, Chengxiang; Luo, Chong; Zhao, Zhiyuan; Yin, Dacheng; Zhao, Yucheng; Zeng, Wenjun

doi:10.21437/interspeech.2021-189

Cited by 3 publications

(3 citation statements)

References 39 publications

(91 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order for robots to sound emotionally expressive, as noted by the older adults in our study, "emotional voice conversion" (i.e., changing the emotion of the utterance) can be applied in text-to-speech (TTS) synthesis that allows variability in vocal intonation (see (Zhou et al, 2022) for a recent review). Recent methods have also incorporated LLMs into speech synthesis with emotional adaptation (Kang et al, 2023;Leng et al, 2023). Furthermore, Voicebox (Le et al, 2023) Mimicking user expressions and behaviors, such as smiling and laughing with the user, can improve interpersonal coordination, boost interaction smoothness, and increase the likeability of the robot (Vicaria and Dickens, 2016).…”

Section: Reflection Of Congruent Emotionsmentioning

confidence: 99%

Recommendations for designing conversational companion robots with older adults through foundation models

Irfan,

Kuoppamäki,

Skantze

2024

Front. Robot. AI

View full text Add to dashboard Cite

Companion robots are aimed to mitigate loneliness and social isolation among older adults by providing social and emotional support in their everyday lives. However, older adults’ expectations of conversational companionship might substantially differ from what current technologies can achieve, as well as from other age groups like young adults. Thus, it is crucial to involve older adults in the development of conversational companion robots to ensure that these devices align with their unique expectations and experiences. The recent advancement in foundation models, such as large language models, has taken a significant stride toward fulfilling those expectations, in contrast to the prior literature that relied on humans controlling robots (i.e., Wizard of Oz) or limited rule-based architectures that are not feasible to apply in the daily lives of older adults. Consequently, we conducted a participatory design (co-design) study with 28 older adults, demonstrating a companion robot using a large language model (LLM), and design scenarios that represent situations from everyday life. The thematic analysis of the discussions around these scenarios shows that older adults expect a conversational companion robot to engage in conversation actively in isolation and passively in social settings, remember previous conversations and personalize, protect privacy and provide control over learned data, give information and daily reminders, foster social skills and connections, and express empathy and emotions. Based on these findings, this article provides actionable recommendations for designing conversational companion robots for older adults with foundation models, such as LLMs and vision-language models, which can also be applied to conversational robots in other domains.

show abstract

Section: Reflection Of Congruent Emotionsmentioning

confidence: 99%

Recommendations for designing conversational companion robots with older adults through foundation models

Irfan,

Kuoppamäki,

Skantze

2024

Front. Robot. AI

View full text Add to dashboard Cite

show abstract

“…In this study, we used speech emotion recognition in two steps [25]. First, emotion embeddings were utilized to generate emotion information for typical utterances.…”

Section: Speech Emotion Recognition (Ser)mentioning

confidence: 99%

“…The diffusion model-based emotion synthesis model proposed in this paper is divided into two styles [25].…”

Section: Diffusion Models With Mel-spectrogramsmentioning

confidence: 99%

A Generation of Enhanced Data by Variational Autoencoders and Diffusion Modeling

Kim,

Lee

2024

Electronics

View full text Add to dashboard Cite

In the domain of emotion recognition in audio signals, the clarity and precision of emotion delivery are of paramount importance. This study aims to augment and enhance the emotional clarity of waveforms (wav) using a technique called stable diffusion. Datasets from EmoDB and RAVDESS, two well-known repositories of emotional audio clips, were utilized as the main sources for all experiments. We used the ResNet-based emotion recognition model to determine the emotion recognition of the augmented waveforms after emotion embedding and enhancement, and compared the enhanced data before and after the enhancement. The results showed that applying a mel-spectrogram-based diffusion model to the existing waveforms enlarges the salience of the embedded emotions, resulting in better identification. This augmentation has significant potential to advance the field of emotion recognition and synthesis, paving the way for improved applications in these areas.

show abstract