Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis

Siegert, Ingo; Lotz, Alicia Flores; Egorow, Olga; Wendemuth, Andreas

doi:10.1007/978-3-319-66429-3_44

Cited by 7 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We suppose that this outcome was caused by the underlying codec. It has been already shown for the hybrid operation mode of OPUS that certain emotions can be recognized better (Siegert et al, 2017), which is in some sense comparable to gender differences in the pronunciation of charisma. Furthermore, for men, only the differences between SPEEX and all other codecs are significant (according to Wilcoxon-Wilcox post-hoc tests with Bonferroni correction of alpha-error levels), whereas for women, also the comparisons of WAV vs. MP3 and WAV vs. OPUS came out significantly.…”

Section: Evaluation and Resultsmentioning

confidence: 77%

Case Report: Women, Be Aware that Your Vocal Charisma can Dwindle in Remote Meetings

Siegert

Niebuhr

2021

Front. Commun.

Self Cite

View full text Add to dashboard Cite

Remote meetings via Zoom, Skype, or Teams limit the range and richness of nonverbal communication signals. Not just because of the typically sub-optimal light, posture, and gaze conditions, but also because of the reduced speaker visibility. Consequently, the speaker’s voice becomes immensely important, especially when it comes to being persuasive and conveying charismatic attributes. However, to offer a reliable service and limit the transmission bandwidth, remote meeting tools heavily rely on signal compression. It has never been analyzed how this compression affects a speaker’s persuasive and overall charismatic impact. Our study addresses this gap for the audio signal. A perception experiment was carried out in which listeners rated short stimulus utterances with systematically varied compression rates and techniques. The scalar ratings concerned a set of charismatic speaker attributes. Results show that the applied audio compression significantly influences the assessment of a speaker’s charismatic impact and that, particularly female speakers seem to be systematically disadvantaged by audio compression rates and techniques. Their charismatic impact decreases over a larger range of different codecs; and this decrease is additionally also more strongly pronounced than for male speakers. We discuss these findings with respect to two possible explanations. The first explanation is signal-based: audio compression codecs could be generally optimized for male speech and, thus, degrade female speech more (particularly in terms of charisma-associated features). Alternatively, the explanation is in the ears of the listeners who are less forgiving of signal degradation when rating female speakers’ charisma.

show abstract

Section: Evaluation and Resultsmentioning

confidence: 77%

Case Report: Women, Be Aware that Your Vocal Charisma can Dwindle in Remote Meetings

Siegert

Niebuhr

2021

Front. Commun.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The workflow followed from working with the raw data until obtaining the trained model is represented in Figure 1. [23], which are the most important features for audio management [24]. We have decided to use 22.5 kHz, a bit-depth of 16 bits and only one audio channel corresponding to monaural sound re-production.…”

Section: Methodsmentioning

confidence: 99%

Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression

Nogales,

Cayuela,

García-Tejedor

2023

Preprint

View full text Add to dashboard Cite

This paper presents an approach to enhancing the clarity and intelligibility of speech in digital communications compromised by various background noises. Utilizing deep learning techniques, specifically a Variational Autoencoder (VAE) with 2D convolutional filters, we aim to suppress background noise in audio signals. Our method focuses on four simulated environmental noise scenarios: storms, wind, traffic, and aircraft. Training dataset has been obtained from public sources (TED-LIUM 3 dataset, which includes audio recordings from the popular TED-TALK series) combining with these background noises. The audio signals were transformed into 2D power spectrograms, upon which our VAE model was trained to filter out the noise and reconstruct clean audio. Our results demonstrate that the model outperforms existing state-of-the-art solutions in noise sup-pression. Although differences in noise types were observed, it was challenging to definitively conclude which background noise most adversely affects speech quality. Results have been assessed with objective methods (mathematical metrics) and subjective (listening to a set of audios by humans). Notably, wind noise showed the smallest deviation between the noisy and cleaned audio, perceived subjectively as the most improved scenario. Future work involves refining the phase calculation of the cleaned audio and creating a more balanced dataset to minimize differences in audio quality across scenarios. Additionally, practical ap-plications of the model in real-time streaming audio are envisaged. This research contributes significantly to the field of audio signal processing by offering a deep learning solution tailored to various noise conditions, enhancing digital communication quality.

show abstract

“…Although a number of studies investigated the general impact of codec compression on spectral quality and acoustic features (Byrne and Foulkes, 2004;Guillemin and Watson, 2009;Siegert et al, 2016), the effects on the preservation of emotions, nonverbally conveyed ones in particular, have rarely been addressed (Albahri et al, 2016;Jokisch et al, 2016). Especially, the preservation of nonverbal emotional cues under low bandwidths is underresearched.…”

Section: Codecs and Their Influence On Prosodymentioning

confidence: 99%

“…The main purpose of applying speech compression for mobile communication is to reduce the bandwidth for transmission, the transmission delay as well as the required system memory and storage (Maruschke et al, 2016;Siegert et al, 2016). Several codecs have been developed to meet various applications with different quality requirements, aiming to retain the speech intelligibility (ITU-T, 1996(ITU-T, , 2014Maruschke et al, 2016).…”

Section: Utilized Audio Codecsmentioning

confidence: 99%

A digital “flat affect”? Popular speech compression codecs and their effects on emotional prosody

Niebuhr

Siegert²

2023

Front. Commun.

View full text Add to dashboard Cite

IntroductionCalls via video apps, mobile phones and similar digital channels are a rapidly growing form of speech communication. Such calls are not only— and perhaps less and less— about exchanging content, but about creating, maintaining, and expanding social and business networks. In the phonetic code of speech, these social and emotional signals are considerably shaped by (or encoded in) prosody. However, according to previous studies, it is precisely this prosody that is significantly distorted by modern compression codecs. As a result, the identification of emotions becomes blurred and can even be lost to the extent that opposing emotions like joy and anger or disgust and sadness are no longer differentiated on the recipients' side. The present study searches for the acoustic origins of these perceptual findings.MethodA set of 108 sentences from the Berlin Database of Emotional Speech served as speech material in our study. The sentences were realized by professional actors (2m, 2f) with seven different emotions (neutral, fear, disgust, joy, boredom, anger, sadness) and acoustically analyzed in the original uncompressed (WAV) version and as well as in strongly compressed versions based on the four popular codecs AMR-WB, MP3, OPUS, and SPEEX. The analysis included 6 tonal (i.e. f0-related) and 7 non-tonal prosodic parameters (e.g., formants as well as acoustic-energy and spectral-slope estimates).ResultsResults show significant, codec-specific distortion effects on all 13 prosodic parameter measurements compared to the WAV reference condition. Means values of automatic measurement can, across sentences, deviate by up to 20% from the values of the WAV reference condition. Moreover, the effects go in opposite directions for tonal and non-tonal parameters. While tonal parameters are distorted by speech compression such that the acoustic differences between emotions are increased, compressing non-tonal parameters make the acoustic-prosodic profiles of emotions more similar to each other, particularly under MP3 and SPEEX compression.DiscussionThe term “flat affect” comes from the medical field and describes a person's inability to express or display emotions. So, does strong compression of emotional speech create a “digital flat affect”? The answer to this question is a conditional “yes”. We provided clear evidence for a “digital flat affect”. However, it seems less strongly pronounced in the present acoustic measurements than in previous perception data, and it manifests itself more strongly in non-tonal than in tonal parameters. We discuss the practical implications of our findings for the everyday use of digital communication devices and critically reflect on the generalizability of our findings, also with respect to their origins in the codecs' inner mechanics.

show abstract

Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis

Cited by 7 publications

References 17 publications

Case Report: Women, Be Aware that Your Vocal Charisma can Dwindle in Remote Meetings

Case Report: Women, Be Aware that Your Vocal Charisma can Dwindle in Remote Meetings

Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression

A digital “flat affect”? Popular speech compression codecs and their effects on emotional prosody

Contact Info

Product

Resources

About