X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition

Pappagari, Raghavendra; Wang, Tianzi; Villalba, Jesús; Chen, Nanxin; Dehak, Najim

doi:10.1109/icassp40776.2020.9054317

Cited by 83 publications

(61 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As mentioned before, x-vectors are DNN speaker embeddings that have seen a growing use in speaker recognition and paralinguistic tasks [16]. While i-vectors represent the total variability subspace of a channel or speaker, x-vectors aim to represent discriminative features between speakers.…”

Section: X-vector Extractionmentioning

confidence: 99%

“…Proposed by Snyder, x-vectors [13] are discriminative DNN speaker embeddings that have outperformed i-vectors in tasks such as speaker and language recognition [14,15]. Recent advances suggest that x-vectors have been successfully applied to paralinguistic tasks such as emotion recognition [16], and to the detection of diseases like Obstructive Sleep Apnea [17] and Alzheimer's [18]. Following the line of research present in [11] and [12], we investigate the reliability of using x-vector speaker embeddings as features for automatic intelligibility prediction in the context of HNC.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer

Quintas¹,

Mauclair²,

Woisard³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

In the context of pathological speech, perceptual evaluation is still the most widely used method for intelligibility estimation. Despite being considered a staple in clinical settings, it has a well-known subjectivity associated with it, which results in greater variances and low reproducibility. On the other hand, due to the increasing computing power and latest research, automatic evaluation has become a growing alternative to perceptual assessments. In this paper we investigate an automatic prediction of speech intelligibility using the x-vector paradigm, in the context of head and neck cancer. Experimental evaluation of the proposed model suggests a high correlation rate when applied to our corpus of HNC patients (p = 0.85). Our approach also displayed the possibility of achieving very high correlation values (p = 0.95) when adapting the evaluation to each individual speaker, displaying a significantly more accurate prediction whilst using smaller amounts of data. These results can also provide valuable insight to the redevelopment of test protocols, which typically tend to be substantial and effort-intensive for patients.

show abstract

Section: X-vector Extractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer

Quintas¹,

Mauclair²,

Woisard³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

show abstract

“…In this paper, 40-dimensional (-d) FBKs with a 10 ms frame duration and 25 ms frame length are used, which is denoted FBK25. FBK features have information about the short-term spectrum but do not contain pitch information that can be important in describing emotional speech [20] and is often complementary to FBKs [21,22]. The log pitch frequency features with probability-of-voicing-weighted mean subtraction over a 1.5 second window are used along with FBKs [23].…”

Section: Audio Featuresmentioning

confidence: 99%

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations

Zhang

Woodland

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, a novel two-branch neural network model structure is proposed for multimodal emotion recognition, which consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB). To capture correlations between each word and its acoustic realisation, the TSB combines speech and text modalities at each input window frame and then uses pooling across time to form a single embedding vector. The TAB, by contrast, provides cross-utterance information by integrating sentence text embeddings from a number of context utterances into another embedding vector. The final emotion classification uses both the TSB and the TAB embeddings. Experimental results on the IEMOCAP dataset demonstrate that the two-branch structure achieves state-of-the-art results in 4-way classification with all common test setups. When using automatic speech recognition (ASR) output instead of manually transcribed reference text, it is shown that the cross-utterance information considerably improves robustness against ASR errors. Furthermore, by incorporating an extra class for all the other emotions, the final 5-way classification system with ASR hypotheses can be viewed as a prototype for more realistic emotion recognition systems.

show abstract

“…Although DNNs are already outperforming traditional approaches [21], that is not true for all tasks and data sets [22]. This has led the community to adopt transfer learning approaches, starting from feature-based [23] and recently moving to DL approaches [24,25,26]. Hence, understanding how transfer learning works could lead to the design of more powerful algorithms that unlock the full potential of DL for SER, and other low-resource audio tasks.…”

Section: Introductionmentioning

confidence: 99%

The Role of Task and Acoustic Similarity in Audio Transfer Learning: Insights from the Speech Emotion Recognition Case

Triantafyllopoulos

Schuller

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

With the rise of deep learning, deep knowledge transfer has emerged as one of the most effective techniques for getting state-of-the-art performance using deep neural networks. A lot of recent research has focused on understanding the mechanisms of transfer learning in the image and language domains. We perform a similar investigation for the case of speech emotion recognition (SER), and conclude that transfer learning for SER is influenced both by the choice of pretraining task and by the differences in acoustic conditions between the upstream and downstream data sets, with the former having a bigger impact. The effect of each factor is isolated by first transferring knowledge between different tasks on the same data, and then from the original data to corrupted versions of it but for the same task. We also demonstrate that layers closer to the input see more adaptation than ones closer to the output in both cases, a finding which explains why previous works often found it necessary to fine-tune all layers during transfer learning.

show abstract

X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition

Cited by 83 publications

References 33 publications

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations

The Role of Task and Acoustic Similarity in Audio Transfer Learning: Insights from the Speech Emotion Recognition Case

Contact Info

Product

Resources

About