Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10825
|View full text |Cite
|
Sign up to set email alerts
|

Overlapped speech and gender detection with WavLM pre-trained features

Abstract: This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data, to build an overlapped speech detection (OSD) an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 22 publications
(30 reference statements)
0
7
0
Order By: Relevance
“…The proxy models are expected to reach similar, eventually slightly lower, performances as the teacher. The score obtained on OSD by the teacher model compares with the state of the art which is 63.4% on DiHard III [14].…”
Section: Segmentation Performancementioning
confidence: 80%
See 3 more Smart Citations
“…The proxy models are expected to reach similar, eventually slightly lower, performances as the teacher. The score obtained on OSD by the teacher model compares with the state of the art which is 63.4% on DiHard III [14].…”
Section: Segmentation Performancementioning
confidence: 80%
“…The teacher model is similar to [14] and is composed of two main parts: feature extraction and sequence modeling. The former is performed using pre-trained Wavlm Large [22] that outputs a sequence of 1024-dimension vectors.…”
Section: Model Architecturesmentioning
confidence: 99%
See 2 more Smart Citations
“…Since overlapping speech is a rare event, the classes are unbalanced [57]. The class balance can be improved by artificially generating additional overlapped data by combining single-speaker utterances from other datasets [18,39] or random segments of the training data at training time [8].…”
Section: B Labelling Proceduresmentioning
confidence: 99%