Online Target Speaker Voice Activity Detection for Speaker Diarization

Wang, Weiqing; Li, Ming; Lin, Quan

doi:10.21437/interspeech.2022-677

Cited by 9 publications

(6 citation statements)

References 72 publications

(123 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rather than relying on enrollment utterances, it estimates speaker profiles from estimated single-speaker regions of the recording to be diarized. It was later shown that the exact knowledge of the number of speakers is unnecessary, as long as a maximum number of speakers potentially present can be given [20], and the attention approach of [21] could do away even with this requirement.…”

Section: Stftmentioning

confidence: 99%

“…From this description, it is obvious that TS-VAD assumes knowledge of the total number K of speakers in the meeting to be diarized, because K defines the dimensionality of the network output. This constraint can be relaxed by incorporating an attention mechanism as was shown in [21], for the case of fully overlapped speech separation. Nevertheless, we keep the original TS-VAD stacking, only increasing the number of speakers from 4 in [18] to 8.…”

Section: B Ts-vad Architecturementioning

confidence: 99%

See 1 more Smart Citation

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Boeddeker¹,

Cord-Landwehr²,

Neumann³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the targetspeaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

show abstract

Section: Stftmentioning

confidence: 99%

Section: B Ts-vad Architecturementioning

confidence: 99%

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Boeddeker¹,

Cord-Landwehr²,

Neumann³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…This paper builds upon our earlier research [32] focused on online speaker diarization. The new contributions of this extension include:…”

Section: Introductionmentioning

confidence: 99%

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Yang¹,

Wang²

2022

Interspeech 2022

View full text Add to dashboard Cite

This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. By adapting the conventional target speaker voice activity detection for real-time operation, this framework can identify speaker activities using self-generated embeddings, resulting in consistent performance without permutation inconsistencies in the inference phase. During the inference process, we employ a front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these framelevel speaker embeddings according to the predictions in the current block. Our model predicts the results for each block and updates the target speakers' embeddings until reaching the end of the signal. Experimental results show that the proposed method outperforms the offline clustering-based diarization system on the DIHARD III and AliMeeting datasets. The proposed method is further extended to multi-channel data, which achieves similar performance with the state-of-the-art offline diarization systems.

show abstract

“…Speaker diarisation (SD), which segments input audio to short utterances according to speaker identity, is going through a rapid breakthrough [1,2]. Based on the success of recent SD systems [3][4][5][6][7][8][9][10][11][12], online SD systems are also being developed [13][14][15][16][17][18][19][20]. In an online SD system, the system should decide the speaker label of a given short segment leveraging only current and past segments, where only a part of past segments are available.…”

Section: Introductionmentioning

confidence: 99%

“…Authors of [16] adopted a memory module for each speaker and contained selected embeddings, where VBx [9] and cosine operations on centroids were used for clustering. Wang et al [17] adapted target speaker voice activity detection (TS-VAD), a successful offline SD framework, to online SD scenarios [8,22]. As mentioned above, the literature is witnessing diverse frameworks.…”

Section: Introductionmentioning

confidence: 99%

Absolute decision corrupts absolutely: conservative online speaker diarisation

Kwon¹,

Heo²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Our focus lies in developing an online speaker diarisation framework which demonstrates robust performance across diverse domains. In online speaker diarisation, outputs generated in real-time are irreversible, and a few misjudgements in the early phase of an input session can lead to catastrophic results. We hypothesise that cautiously increasing the number of estimated speakers is of paramount importance among many other factors. Thus, our proposed framework includes decreasing the number of speakers by one when the system judges that an increase in the past was faulty. We also adopt dual buffers, checkpoints and centroids, where checkpoints are combined with silhouette coefficients to estimate the number of speakers and centroids represent speakers. Again, we believe that more than one centroid can be generated from one speaker. Thus we design a clustering-based label matching technique to assign labels in realtime. The resulting system is lightweight yet surprisingly effective. The system demonstrates state-of-the-art performance on DIHARD II and III datasets, where it is also competitive in AMI and VoxConverse test sets.

show abstract

Online Target Speaker Voice Activity Detection for Speaker Diarization

Cited by 9 publications

References 72 publications

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Absolute decision corrupts absolutely: conservative online speaker diarisation

Contact Info

Product

Resources

About