2021
DOI: 10.48550/arxiv.2110.06650
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Abstract: In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

5
9
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
3

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(14 citation statements)
references
References 73 publications
5
9
0
Order By: Relevance
“…The numbers back up two of our earlier findings: the large architecture is superior to the base model and HuBERT outperforms wav2vec 2.0. Their CCC performance surpasses both that of Triantafyllopoulos et al [4] (.515), who proposed a multimodal fusion of pre-trained BERT embeddings with an untrained CNN model, and of Li et al [35] (.377) who pre-train a CRNN model on LibriSpeech using Contrastive Predictive Coding and subsequently fine-tuned it on MSP-Podcast.…”
mentioning
confidence: 90%
See 4 more Smart Citations
“…The numbers back up two of our earlier findings: the large architecture is superior to the base model and HuBERT outperforms wav2vec 2.0. Their CCC performance surpasses both that of Triantafyllopoulos et al [4] (.515), who proposed a multimodal fusion of pre-trained BERT embeddings with an untrained CNN model, and of Li et al [35] (.377) who pre-train a CRNN model on LibriSpeech using Contrastive Predictive Coding and subsequently fine-tuned it on MSP-Podcast.…”
mentioning
confidence: 90%
“…Inspired by Wang et al [23], we use a simple head architecture, which we build on top of wav2vec 2.0 [21] or HuBERT [22] (see Figure 1): we apply average pooling over the hidden states of the last transformer layer and feed the result through a hidden layer and a final output layer (the pooled embeddings and the hidden layer outputs are dropped out). For fine-tuning on the downstream task, we use the ADAM optimiser with CCC loss, which is the standard loss function used for dimensional SER [4,35,43], and a fixed learning rate of 1e−4. We run for 5 epochs with a batch size of 32 and keep the checkpoint with best performance on the development set.…”
Section: Architecturementioning
confidence: 99%
See 3 more Smart Citations