Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-750
|View full text |Cite
|
Sign up to set email alerts
|

Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 26 publications
(16 citation statements)
references
References 0 publications
0
16
0
Order By: Relevance
“…In the future work, we aim at training a magnitude-aware embedding extractor from scratch to get rid of the ad hoc duration variability compensation transform. Another direction in-cludes integrating the magnitude-based quality assessment into the two-step pipelines based on the target-speaker VAD such as [42].…”
Section: Discussionmentioning
confidence: 99%
“…In the future work, we aim at training a magnitude-aware embedding extractor from scratch to get rid of the ad hoc duration variability compensation transform. Another direction in-cludes integrating the magnitude-based quality assessment into the two-step pipelines based on the target-speaker VAD such as [42].…”
Section: Discussionmentioning
confidence: 99%
“…Figure 1 illustrates our overall speaker diarization system for the 2022 M2MeT challenge. The core technology is that we used TS-VAD with an unknown number of multiple speakers [8] and tried some new strategies for the multi-channel Mandarin meeting scenario with heavy reverb and noise. In the training stage, the training data for TS-VAD will be introduced in Section 3.…”
Section: System Descriptionmentioning
confidence: 99%
“…log Mel filter-banks (FBANKs)) as input, along with i-vectors corresponding to each speaker, and predicts per-frame speech activities for a fixed number of speakers simultaneously, which directly handles overlapping problems. In the flexible number of speakers case [8], the number of output nodes N is chosen as the maximum number of speakers in any recording in the training set, which is 4 for the ALIMEETING whose speaker number of each recording ranges from 2 to 4. First, the number of speakers N in each recording is estimated according to the oracle label when training and a CSD system when decoding.…”
Section: Ts-vad With An Unknown Number Of Speakersmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, most end-to-end methods fix the number of output speakers due to their network architecture [15], [27]. Most methods that enable the inference of a flexible number of speakers conduct it by outputting null speech activities for absent speakers, so the maximum number of speakers is limited [18], [28]. Some methods use speaker-wise auto-regressive inference to avoid setting the maximum number of speakers by the network architecture; but in practice, the number of output speakers is still capped by the training dataset [16], [17], [29], [30].…”
Section: Introductionmentioning
confidence: 99%