Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker

He, Maokui; Raj, Desh; Huang, Zili; Du, Jun; Chen, Zhuo; Watanabe, Shinji

doi:10.21437/interspeech.2021-750

Cited by 26 publications

(16 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the future work, we aim at training a magnitude-aware embedding extractor from scratch to get rid of the ad hoc duration variability compensation transform. Another direction in-cludes integrating the magnitude-based quality assessment into the two-step pipelines based on the target-speaker VAD such as [42].…”

Section: Discussionmentioning

confidence: 99%

Magnitude-aware Probabilistic Speaker Embeddings

Kuzmin,

Fedorov,

Sholokhov

2022

Preprint

View full text Add to dashboard Cite

Recently, hyperspherical embeddings have established themselves as a dominant technique for face and voice recognition. Specifically, Euclidean space vector embeddings are learned to encode person-specific information in their direction while ignoring the magnitude. However, recent studies have shown that the magnitudes of the embeddings extracted by deep neural networks may indicate the quality of the corresponding inputs. This paper explores the properties of the magnitudes of the embeddings related to quality assessment and out-of-distribution detection. We propose a new probabilistic speaker embedding extractor using the information encoded in the embedding magnitude and leverage it in the speaker verification pipeline. We also propose several quality-aware diarization methods and incorporate the magnitudes in those. Our results indicate significant improvements over magnitude-agnostic baselines both in speaker verification and diarization tasks.

show abstract

Section: Discussionmentioning

confidence: 99%

Magnitude-aware Probabilistic Speaker Embeddings

Kuzmin,

Fedorov,

Sholokhov

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Figure 1 illustrates our overall speaker diarization system for the 2022 M2MeT challenge. The core technology is that we used TS-VAD with an unknown number of multiple speakers [8] and tried some new strategies for the multi-channel Mandarin meeting scenario with heavy reverb and noise. In the training stage, the training data for TS-VAD will be introduced in Section 3.…”

Section: System Descriptionmentioning

confidence: 99%

“…log Mel filter-banks (FBANKs)) as input, along with i-vectors corresponding to each speaker, and predicts per-frame speech activities for a fixed number of speakers simultaneously, which directly handles overlapping problems. In the flexible number of speakers case [8], the number of output nodes N is chosen as the maximum number of speakers in any recording in the training set, which is 4 for the ALIMEETING whose speaker number of each recording ranges from 2 to 4. First, the number of speakers N in each recording is estimated according to the oracle label when training and a CSD system when decoding.…”

Section: Ts-vad With An Unknown Number Of Speakersmentioning

confidence: 99%

“…Further research was taken on handling flexible number of speakers on EEND [7]. And [8] provides a strategy for handling an unknown numbers of multiple speakers of TS-VAD. Based on the second network-based method, we explore the effect of training data augmentation and post-processing on TS-VAD method with an unknown number of multiple speakers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

He¹,

Xiang²,

Zhou³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge. These techniques are designed to handle multispeaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition. First, for data preparation and augmentation in training TS-VAD models, speech data containing both real meetings and simulated indoor conversations are used. Second, in refining results obtained after TS-VAD based decoding, we perform a series of post-processing steps to improve the VAD results needed to reduce diarization error rates (DERs). Tested on the ALIMEETING corpus, the newly released Mandarin meeting dataset used in M2MeT, we demonstrate that our proposed system can decrease the DER by up to 66.55/60.59% relatively when compared with classical clustering based diarization on the Eval/Test set.

show abstract

“…In contrast, most end-to-end methods fix the number of output speakers due to their network architecture [15], [27]. Most methods that enable the inference of a flexible number of speakers conduct it by outputting null speech activities for absent speakers, so the maximum number of speakers is limited [18], [28]. Some methods use speaker-wise auto-regressive inference to avoid setting the maximum number of speakers by the network architecture; but in practice, the number of output speakers is still capped by the training dataset [16], [17], [29], [30].…”

Section: Introductionmentioning

confidence: 99%