ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414578
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Scale Speaker Diarization with Neural Affinity Score Fusion

Abstract: Predicting the speaker's identity of short speech segments in human dialogue has been considered one of the most challenging problems in speech signal processing. Speaker representations of short speech segments tend to be unreliable, resulting in poor fidelity of speaker representations in tasks requiring speaker recognition. In this paper, we propose an unconventional method that tackles the trade-off between temporal resolution and the quality of the speaker representations. To find a set of weights that ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 17 publications
(21 reference statements)
0
2
0
Order By: Relevance
“…To address this, one line of research has combined embeddings extracted with different window sizes and shifts. In [Park et al, 2021], the affinity matrices of three different configurations are processed with an NN and the resulting matrix is used for spectral clustering. In [Kwon et al, 2022], embeddings from different resolutions are paired with vectors that denote which scale they were extracted with (analogously to positional encoding in Transformer models) and they are processed with an attention mechanism to obtain similarities between the embeddings (through the inherent attention weights).…”
Section: Neural Network For Speaker Embeddings and Affinity Matricesmentioning
confidence: 99%
“…To address this, one line of research has combined embeddings extracted with different window sizes and shifts. In [Park et al, 2021], the affinity matrices of three different configurations are processed with an NN and the resulting matrix is used for spectral clustering. In [Kwon et al, 2022], embeddings from different resolutions are paired with vectors that denote which scale they were extracted with (analogously to positional encoding in Transformer models) and they are processed with an attention mechanism to obtain similarities between the embeddings (through the inherent attention weights).…”
Section: Neural Network For Speaker Embeddings and Affinity Matricesmentioning
confidence: 99%
“…These systems involve a dedicated model trained to detect the exact moment when speakers change. To deal with a trade-off between long and short segment lengths, a group of works employs multi-scale segmentation [6,7]. They use multiple scales (segment lengths) and fuse the similarity scores between embeddings obtained from the results of each scale.…”
Section: Introduction and Related Workmentioning
confidence: 99%