Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-677
|View full text |Cite
|
Sign up to set email alerts
|

Online Target Speaker Voice Activity Detection for Speaker Diarization

Abstract: Audio-visual learning has demonstrated promising results in many classical speech tasks (e.g., speech separation, automatic speech recognition, wake-word spotting). We believe that introducing visual modality will also benefit speaker diarization. To date, Target-Speaker Voice Activity Detection (TS-VAD) plays an important role in highly accurate speaker diarization. However, previous TS-VAD models take audio features and utilize the speaker's acoustic footprint to distinguish his or her personal speech activi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 72 publications
(123 reference statements)
0
6
0
Order By: Relevance
“…Rather than relying on enrollment utterances, it estimates speaker profiles from estimated single-speaker regions of the recording to be diarized. It was later shown that the exact knowledge of the number of speakers is unnecessary, as long as a maximum number of speakers potentially present can be given [20], and the attention approach of [21] could do away even with this requirement.…”
Section: Stftmentioning
confidence: 99%
See 1 more Smart Citation
“…Rather than relying on enrollment utterances, it estimates speaker profiles from estimated single-speaker regions of the recording to be diarized. It was later shown that the exact knowledge of the number of speakers is unnecessary, as long as a maximum number of speakers potentially present can be given [20], and the attention approach of [21] could do away even with this requirement.…”
Section: Stftmentioning
confidence: 99%
“…From this description, it is obvious that TS-VAD assumes knowledge of the total number K of speakers in the meeting to be diarized, because K defines the dimensionality of the network output. This constraint can be relaxed by incorporating an attention mechanism as was shown in [21], for the case of fully overlapped speech separation. Nevertheless, we keep the original TS-VAD stacking, only increasing the number of speakers from 4 in [18] to 8.…”
Section: B Ts-vad Architecturementioning
confidence: 99%
“…This paper builds upon our earlier research [32] focused on online speaker diarization. The new contributions of this extension include:…”
Section: Introductionmentioning
confidence: 99%
“…Speaker diarisation (SD), which segments input audio to short utterances according to speaker identity, is going through a rapid breakthrough [1,2]. Based on the success of recent SD systems [3][4][5][6][7][8][9][10][11][12], online SD systems are also being developed [13][14][15][16][17][18][19][20]. In an online SD system, the system should decide the speaker label of a given short segment leveraging only current and past segments, where only a part of past segments are available.…”
Section: Introductionmentioning
confidence: 99%
“…Authors of [16] adopted a memory module for each speaker and contained selected embeddings, where VBx [9] and cosine operations on centroids were used for clustering. Wang et al [17] adapted target speaker voice activity detection (TS-VAD), a successful offline SD framework, to online SD scenarios [8,22]. As mentioned above, the literature is witnessing diverse frameworks.…”
Section: Introductionmentioning
confidence: 99%