Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Horiguchi, Shota; Watanabe, Shinji; García, Paola; Takashima, Yuki; Kawaguchi, Yohei

doi:10.1109/taslp.2022.3233237

Cited by 8 publications

(6 citation statements)

References 68 publications

(147 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To make the number of output speakers flexible and unlimited, EEND-vector clustering (EEND-VC) [12]- [14] integrates end-to-end and clustering approaches, which deploys an EEND model for shortly divided audio blocks and then matches the inter-block speaker labels by clustering on speaker embeddings. In addition, EEND-GLA [15], [16] calculates local attractors from each short block and finds the speaker correspondence based on similarities between inter-block attractors. As its training only requires relative speaker labels within the recording, EEND-GLA is practical for adapting models on in-the-wild datasets without globally unique speaker labels.…”

Section: B End-to-end Neural Diarizationmentioning

confidence: 99%

“…Nevertheless, permutationinvariant training in EEND-based methods causes performance degradation when the number of speakers increases in long audios. Although a few studies [12]- [16] have explored the unsupervised clustering to address this problem, their results are still unsatisfactory. Recently, Target-Speaker Voice Activity Detection (TS-VAD) approaches [17]- [20] become attractive, which combines advantage of modularized methods and endto-end neural networks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Online Target Speaker Voice Activity Detection for Speaker Diarization

Wang¹,

Li²,

Lin³

2022

Interspeech 2022

View full text Add to dashboard Cite

Audio-visual learning has demonstrated promising results in many classical speech tasks (e.g., speech separation, automatic speech recognition, wake-word spotting). We believe that introducing visual modality will also benefit speaker diarization. To date, Target-Speaker Voice Activity Detection (TS-VAD) plays an important role in highly accurate speaker diarization. However, previous TS-VAD models take audio features and utilize the speaker's acoustic footprint to distinguish his or her personal speech activities, which is easily affected by overlapped speech in multi-speaker scenarios. Although visual information naturally tolerates overlapped speech, it suffers from spatial occlusion, low resolution, etc. The potential modality-missing problem blocks TS-VAD towards an audio-visual approach.This paper proposes a novel Multi-Input Multi-Output Target-Speaker Voice Activity Detection (MIMO-TSVAD) framework for speaker diarization. The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework. Experimental results show that the MIMO-TSVAD framework demonstrates state-of-the-art performance on the VoxConverse, DIHARD-III, and MISP 2022 datasets under corresponding evaluation metrics, obtaining the Diarization Error Rates (DERs) of 4.18%, 10.10%, and 8.15%, respectively. In addition, it can perform robustly in heavy lip-missing scenarios.

show abstract

Section: B End-to-end Neural Diarizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Online Target Speaker Voice Activity Detection for Speaker Diarization

Wang¹,

Li²,

Lin³

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Another approach to tackle the issue of speaker permutation ambiguity involves maintaining acoustic features within a speaker-tracing buffer [26,30]. On top of this approach, Horiguchi et al [31] enhanced EEND-GLA for online usage by integrating a speaker-tracing buffer. Remarkably, this method achieved SOTA performance across various datasets, outperforming numerous offline modularized speaker diarization systems.…”

Section: Bmentioning

confidence: 99%

“…During training stage, this inconsistency can be solved using a permutation free objective [20,29], but it limits the EEND to be extended for online purpose as the order of speakers are always changing as a new signal coming. One solution is to employ a buffer to store both prior inputs and results to align the current results with previous one [26,30,31], but this usually requires a large buffer for a satisfying performance.…”

Section: Introductionmentioning

confidence: 99%

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Yang¹,

Wang²

2022

Interspeech 2022

View full text Add to dashboard Cite

This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. By adapting the conventional target speaker voice activity detection for real-time operation, this framework can identify speaker activities using self-generated embeddings, resulting in consistent performance without permutation inconsistencies in the inference phase. During the inference process, we employ a front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these framelevel speaker embeddings according to the predictions in the current block. Our model predicts the results for each block and updates the target speakers' embeddings until reaching the end of the signal. Experimental results show that the proposed method outperforms the offline clustering-based diarization system on the DIHARD III and AliMeeting datasets. The proposed method is further extended to multi-channel data, which achieves similar performance with the state-of-the-art offline diarization systems.

show abstract

“…Additionally, Horiguchi et al [21] introduced the EEND-EDA system, which utilizes an LSTM encoder-decoder network to model attractors for each speaker. Furthermore, researchers also proposed two-stage hybrid systems [22], [23] to address the challenge of handling a flexible number of speakers. These systems first output diarization results for short segments with a limited number of speakers using EEND, and then employ a clustering algorithm to solve the inter-segment speaker permutation problem.…”

Section: Introductionmentioning

confidence: 99%

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

Chen¹,

Han²,

Wang³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-toend neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%), and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.

show abstract

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Cited by 8 publications

References 68 publications

Online Target Speaker Voice Activity Detection for Speaker Diarization

Online Target Speaker Voice Activity Detection for Speaker Diarization

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

Contact Info

Product

Resources

About