ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414094
|View full text |Cite
|
Sign up to set email alerts
|

Short-Time Spectral Aggregation for Speaker Embedding

Abstract: State-of-the-art speaker verification systems take frame-level acoustics features as input and produce fixed-dimensional embeddings as utterance-level representations. Thus, how to aggregate information from frame-level features is vital for achieving high performance. This paper introduces short-time spectral pooling (STSP) for better aggregation of frame-level information. STSP transforms the temporal feature maps of a speaker embedding network into the spectral domain and extracts the lowest spectral compon… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(9 citation statements)
references
References 18 publications
0
9
0
Order By: Relevance
“…Later, more advanced networks, such as ResNets [9], DenseNets [10], [12], and Res2Nets [11], [13], were introduced to better model the spectral-temporal relationship across the acoustic frames. Simultaneously, diverse aggregation methods have been proposed to aggregate the frame-level information into utterance-level embeddings, e.g., statistics pooling [2], multihead attentive pooling [14], NetVLAD-based pooling [9], short-time spectral pooling [15], [16], etc. Also, different training losses besides the softmax loss have been used in deep speaker embedding to achieve better discriminative power.…”
Section: A Text-independent Speaker Verificationmentioning
confidence: 99%
“…Later, more advanced networks, such as ResNets [9], DenseNets [10], [12], and Res2Nets [11], [13], were introduced to better model the spectral-temporal relationship across the acoustic frames. Simultaneously, diverse aggregation methods have been proposed to aggregate the frame-level information into utterance-level embeddings, e.g., statistics pooling [2], multihead attentive pooling [14], NetVLAD-based pooling [9], short-time spectral pooling [15], [16], etc. Also, different training losses besides the softmax loss have been used in deep speaker embedding to achieve better discriminative power.…”
Section: A Text-independent Speaker Verificationmentioning
confidence: 99%
“…However, because DFT can only be applied to deterministic or wide-sense stationary signals, it is not suitable for non-stationary speech signals [22]. To account for the nonstationarity of the convolutional feature maps in speaker embedding networks, short-time spectral pooling (STSP) was proposed in [23] by replacing DFT with short-time Fourier transform (STFT) [24]. It was shown in [23] that STSP is a generalized statistics pooling method.…”
Section: A Motivationmentioning
confidence: 99%
“…To account for the nonstationarity of the convolutional feature maps in speaker embedding networks, short-time spectral pooling (STSP) was proposed in [23] by replacing DFT with short-time Fourier transform (STFT) [24]. It was shown in [23] that STSP is a generalized statistics pooling method. This is because from a Fourier perspective, statistics pooling only exploits the DC (zero-frequency) components in the spectral domain, whereas STSP incorporates more spectral components besides the DC ones during aggregation and is able to retain richer speaker information.…”
Section: A Motivationmentioning
confidence: 99%
See 2 more Smart Citations