Avoid Overthinking in Self-Supervised Models for Speech Recognition

Berrebbi, Dan; Yan, Brian; Watanabe, Shinji

doi:10.1109/icassp49357.2023.10095335

Cited by 2 publications

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

Fujita,

Ogawa,

Kobayashi

2023

IEEE Access

View full text Add to dashboard Cite

This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the selfconditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with selfconditioning, leading to superior performance compared with existing diarization models.INDEX TERMS Encoder-decoder-based attractors, End-to-end neural diarization, Intermediate objectives, Non-autoregressive models, Self-conditioning, Speaker diarization,

show abstract