End-to-End Neural Diarization: From Transformer to Conformer

Liu, Yi Chieh; Han, Eun‐Jung; Lee, Chul; Stolcke, Andreas

doi:10.21437/interspeech.2021-1909

Cited by 26 publications

(34 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…End-to-end neural diarization (EEND) [7][8][9] is a method for estimating the speech activities of each speaker from a multiple speaker input mixture using an end-to-end neural network. Given a T -length sequence of F -dimensional acoustic features…”

Section: Background Methodsmentioning

confidence: 99%

See 1 more Smart Citation

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Ueda¹,

Maiti²,

Watanabe³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we present a novel framework that jointly performs speaker diarization, speech separation, and speaker counting. Our proposed method combines end-to-end speaker diarization and speech separation methods, namely, End-to-End Neural Speaker Diarization with Encoder-Decoder-based Attractor calculation (EEND-EDA) and the Convolutional Timedomain Audio Separation Network (ConvTasNet) as multitasking joint model. We also propose the multiple 1×1 convolutional layer architecture for estimating the separation masks corresponding to the number of speakers, and a post-processing technique for refining the separated speech signal with speech activity. Experiments using LibriMix dataset show that our proposed method outperforms the baselines in terms of diarization and separation performance for both fixed and flexible numbers of speakers, as well as speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit 1 .

show abstract

Section: Background Methodsmentioning

confidence: 99%

“…Hence, they have a problem with realistic data with speaker overlaps. Alternatively, fully end-to-end neural diarization (EEND) [7][8][9] systems can handle speaker overlap by training with the speaker overlap data. One drawback of EEND is the number of speakers has to be known and fixed beforehand.…”

Section: Introductionmentioning

confidence: 99%

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Ueda¹,

Maiti²,

Watanabe³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Our baseline model is a Conformer-based EEND model described in [13]. The acoustic input to the model XAcoustic is subsampled by a factor of 4 with a 2-D convolutional layer.…”

Section: Conformer-based Diarizationmentioning

confidence: 99%

“…Four Conformer layers of size 256 with 4 attention heads are used in the encoder. No positional encodings are used based on the finding in [13]. All models are trained with a batch size of 192 with the Adam optimizer, with the learning rate scheme described in [23] for 200 epochs.…”

Section: Data and Experimentsmentioning

confidence: 99%

ASR-aware end-to-end neural diarization

Khare¹,

Han²,

Yang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20\% relative to the baseline.

show abstract

“…Alternatively, the end-to-end approach for speaker diarization is gaining attention due to its simple architecture and promising results compared to the conventional cascaded systems [17,18,19,20]. In this approach, diarization models are designed to estimate each speaker's speech activities from an input multi-speaker conversational recording.…”

Section: Introductionmentioning

confidence: 99%

Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization

Yamashita¹,

Horiguchi²,

Homma³

2022

Preprint

View full text Add to dashboard Cite

This paper investigates a method for simulating natural conversation in the model training of end-to-end neural diarization (EEND). Due to the lack of any annotated real conversational dataset, EEND is usually pretrained on a large-scale simulated conversational dataset first and then adapted to the target real dataset. Simulated datasets play an essential role in the training of EEND, but as yet there has been insufficient investigation into an optimal simulation method. We thus propose a method to simulate natural conversational speech. In contrast to conventional methods, which simply combine the speech of multiple speakers, our method takes turn-taking into account. We define four types of speaker transition and sequentially arrange them to simulate natural conversations. The dataset simulated using our method was found to be statistically similar to the real dataset in terms of the silence and overlap ratios. The experimental results on two-speaker diarization using the CALLHOME and CSJ datasets showed that the simulated dataset contributes to improving the performance of EEND.

show abstract

End-to-End Neural Diarization: From Transformer to Conformer

Cited by 26 publications

References 20 publications

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

ASR-aware end-to-end neural diarization

Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization

Contact Info

Product

Resources

About