ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414371
|View full text |Cite
|
Sign up to set email alerts
|

BW-EDA-EEND: streaming END-TO-END Neural Speaker Diarization for a Variable Number of Speakers

Abstract: We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants: For unlimited-latency… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 24 publications
(27 citation statements)
references
References 25 publications
0
27
0
Order By: Relevance
“…In future work, we hope to explore similar techniques with a system like [24] that does not require prior knowledge of the number of speakers and allows inference with low latency. Note that stateof-the-art ASR systems like RNN-Transducer [25] cannot be used to extract phone and word-position information, as they output subword units and not phones.…”
Section: Discussionmentioning
confidence: 99%
“…In future work, we hope to explore similar techniques with a system like [24] that does not require prior knowledge of the number of speakers and allows inference with low latency. Note that stateof-the-art ASR systems like RNN-Transducer [25] cannot be used to extract phone and word-position information, as they output subword units and not phones.…”
Section: Discussionmentioning
confidence: 99%
“…This method uses speaker embeddings to convey information between blocks to make the order of output speakers consistent. BW-EDA-EEND [26] replaced the Transformer encoders in EEND with Transformer-XL [51] to extend EEND-EDA [16], [17] to deal with block-wise inputs. In this method, the hidden state embeddings obtained during processing the previous blocks are used to process the current block, thereby solving the speaker permutation ambiguity between blocks.…”
Section: Related Workmentioning
confidence: 99%
“…In particular, the clustering part requires a separate algorithm to be developed because most of the wellestablished clustering algorithms for speaker diarization [11], [14] cannot be used for online inference. However, end-to-end methods can relatively be easily utilized in online diarization by using a buffer to store the previous input-result pairs [24], [25] or by simply replacing the network architecture with the one that enables online inference [26].…”
Section: Introductionmentioning
confidence: 99%
“…First, the 2000 NIST Speaker Recognition Evaluation [34] dataset, usually referred as "Callhome" [35]. We report results on the subset of 2-speaker conversations using the standard Callhome partition 6 . We will refer to the parts as CH1-2spk and CH2-2spk.…”
Section: Datamentioning
confidence: 99%
“…Most works following the EEND principle have focused on improvements on the architecture or modeling. Some by using self-attention layers [4] or conformer layers [5] instead of the original BLSTM layers for feature encoding; others have focused on more complex diarization scenarios such as its online fashion [6,7] or when more than one microphone is available [8] or by improving the model iteratively using pseudo-labels [9]. Some have used EEND together with more standard approaches by using EEND-inspired models to find overlaps among pairs of speakers in the output of a cascaded system [10] or leveraging EEND's VAD performance by us-ing an external VAD system [11] or combining short duration diarization outputs to produce better whole-utterance diarization [12,13,14,15].…”
Section: Introductionmentioning
confidence: 99%