BW-EDA-EEND: streaming END-TO-END Neural Speaker Diarization for a Variable Number of Speakers

Han, Eun‐Jung; Lee, Chul; Stolcke, Andreas

doi:10.1109/icassp39728.2021.9414371

Cited by 24 publications

(27 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In future work, we hope to explore similar techniques with a system like [24] that does not require prior knowledge of the number of speakers and allows inference with low latency. Note that stateof-the-art ASR systems like RNN-Transducer [25] cannot be used to extract phone and word-position information, as they output subword units and not phones.…”

Section: Discussionmentioning

confidence: 99%

ASR-aware end-to-end neural diarization

Khare¹,

Han²,

Yang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20\% relative to the baseline.

show abstract

Section: Discussionmentioning

confidence: 99%

ASR-aware end-to-end neural diarization

Khare¹,

Han²,

Yang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This method uses speaker embeddings to convey information between blocks to make the order of output speakers consistent. BW-EDA-EEND [26] replaced the Transformer encoders in EEND with Transformer-XL [51] to extend EEND-EDA [16], [17] to deal with block-wise inputs. In this method, the hidden state embeddings obtained during processing the previous blocks are used to process the current block, thereby solving the speaker permutation ambiguity between blocks.…”

Section: Related Workmentioning

confidence: 99%

“…In particular, the clustering part requires a separate algorithm to be developed because most of the wellestablished clustering algorithms for speaker diarization [11], [14] cannot be used for online inference. However, end-to-end methods can relatively be easily utilized in online diarization by using a buffer to store the previous input-result pairs [24], [25] or by simply replacing the network architecture with the one that enables online inference [26].…”

Section: Introductionmentioning

confidence: 99%

Online Neural Diarization of Unlimited Numbers of Speakers

Horiguchi¹,

Watanabe²,

Garcia³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…First, the 2000 NIST Speaker Recognition Evaluation [34] dataset, usually referred as "Callhome" [35]. We report results on the subset of 2-speaker conversations using the standard Callhome partition 6 . We will refer to the parts as CH1-2spk and CH2-2spk.…”

Section: Datamentioning

confidence: 99%

“…Most works following the EEND principle have focused on improvements on the architecture or modeling. Some by using self-attention layers [4] or conformer layers [5] instead of the original BLSTM layers for feature encoding; others have focused on more complex diarization scenarios such as its online fashion [6,7] or when more than one microphone is available [8] or by improving the model iteratively using pseudo-labels [9]. Some have used EEND together with more standard approaches by using EEND-inspired models to find overlaps among pairs of speakers in the output of a cascaded system [10] or leveraging EEND's VAD performance by us-ing an external VAD system [11] or combining short duration diarization outputs to produce better whole-utterance diarization [12,13,14,15].…”

Section: Introductionmentioning

confidence: 99%

From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization

Landini¹,

Lozano-Díez²,

Díez³

et al. 2022

Preprint

View full text Add to dashboard Cite

End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in speaker diarization. EEND presents an attractive alternative to standard cascaded diarization systems since a single system is trained at once to deal with the whole diarization problem. Several EEND variants and approaches are being proposed, however, all these models require large amounts of annotated data for training but available annotated data are scarce. Thus, EEND works have used mostly simulated mixtures for training. However, simulated mixtures do not resemble real conversations in many aspects. In this work we present an alternative method for creating synthetic conversations that resemble real ones by using statistics about distributions of pauses and overlaps estimated on genuine conversations. Furthermore, we analyze the effect of the source of the statistics, different augmentations and amounts of data. We demonstrate that our approach performs substantially better than the original one, while reducing the dependence on the fine-tuning stage. Experiments are carried out on 2-speaker telephone conversations of Callhome and DIHARD 3. Together with this publication, we release our implementations of EEND and the method for creating simulated conversations.

show abstract

BW-EDA-EEND: streaming END-TO-END Neural Speaker Diarization for a Variable Number of Speakers

Cited by 24 publications

References 25 publications

ASR-aware end-to-end neural diarization

ASR-aware end-to-end neural diarization

Online Neural Diarization of Unlimited Numbers of Speakers

From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization

Contact Info

Product

Resources

About