Overlapped speech and gender detection with WavLM pre-trained features

Lebourdais, Martin; Tahon, Marie; Laurent, Antoine; Meignier, Sylvain

doi:10.21437/interspeech.2022-10825

Cited by 7 publications

(7 citation statements)

References 22 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proxy models are expected to reach similar, eventually slightly lower, performances as the teacher. The score obtained on OSD by the teacher model compares with the state of the art which is 63.4% on DiHard III [14].…”

Section: Segmentation Performancementioning

confidence: 80%

“…The teacher model is similar to [14] and is composed of two main parts: feature extraction and sequence modeling. The former is performed using pre-trained Wavlm Large [22] that outputs a sequence of 1024-dimension vectors.…”

Section: Model Architecturesmentioning

confidence: 99%

“…Currently, the segmentation is mainly performed with neural networks and supervised learning. While each task has been generally solved independently as a binary frame-wise classification task (SAD [12], OSD [13,14], MD [15,16]), more recent approaches propose to solve multiple tasks simultaneously. The multiclass model predicts a single class and class intersection is empty.…”

Section: Introductionmentioning

confidence: 99%

“…For example, in [17], authors propose to segment speech, music, and noise with a single multiclass model. A few works also report joint SAD and OSD [18][19][20]. In this paper, SAD, OSD, MD, and ND are simultaneously solved as a multilabel frame classification task.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

An Explainable Proxy Model for Multilabel Audio Segmentation

Mariotte,

Almudévar,

Tahon

et al. 2024

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Section: Segmentation Performancementioning

confidence: 80%

Section: Model Architecturesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

An Explainable Proxy Model for Multilabel Audio Segmentation

Mariotte,

Almudévar,

Tahon

et al. 2024

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Since overlapping speech is a rare event, the classes are unbalanced [57]. The class balance can be improved by artificially generating additional overlapped data by combining single-speaker utterances from other datasets [18,39] or random segments of the training data at training time [8].…”

Section: B Labelling Proceduresmentioning

confidence: 99%

Microphone Array Channel Combination Algorithms for Overlapped Speech Detection

Mariotte¹,

Larcher²,

Montrésor³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Overlapped speech occurs when multiple speakers are simultaneously active. This may lead to severe performance degradation in automatic speech processing systems such as speaker diarization. Overlapped speech detection (OSD) aims at detecting time segments in which several speakers are simultaneously active. Recent deep neural network architectures have shown impressive results in the close-talk scenario. However, performance tends to deteriorate in the context of distant speech. Microphone arrays are often considered under these conditions to record signals including spatial information. This paper investigates the use of the self-attention channel combinator (SACC) system as a feature extractor for OSD. This model is also extended in the complex space (cSACC) to improve the interpretability of the approach. Results show that distant OSD performance with self-attentive models gets closer to the nearfield condition. A detailed analysis of the cSACC combinationweights is also conducted showing that the self-attention module focuses attention on the speakers' direction.

show abstract

Improving Speaker Gender Detection by Combining Pitch and SDC

Mohanty,

Cherukuri

2024

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

Overlapped speech and gender detection with WavLM pre-trained features

Cited by 7 publications

References 22 publications

An Explainable Proxy Model for Multilabel Audio Segmentation

An Explainable Proxy Model for Multilabel Audio Segmentation

Microphone Array Channel Combination Algorithms for Overlapped Speech Detection

Improving Speaker Gender Detection by Combining Pitch and SDC

Contact Info

Product

Resources

About