Short-Time Spectral Aggregation for Speaker Embedding

Tu, Youzhi; Mak, Man‐Wai

doi:10.1109/icassp39728.2021.9414094

Cited by 5 publications

(9 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Later, more advanced networks, such as ResNets [9], DenseNets [10], [12], and Res2Nets [11], [13], were introduced to better model the spectral-temporal relationship across the acoustic frames. Simultaneously, diverse aggregation methods have been proposed to aggregate the frame-level information into utterance-level embeddings, e.g., statistics pooling [2], multihead attentive pooling [14], NetVLAD-based pooling [9], short-time spectral pooling [15], [16], etc. Also, different training losses besides the softmax loss have been used in deep speaker embedding to achieve better discriminative power.…”

Section: A Text-independent Speaker Verificationmentioning

confidence: 99%

A Survey on Text-Dependent and Text-Independent Speaker Verification

Lin

Mak

2022

IEEE Access

Self Cite

View full text Add to dashboard Cite

Speaker verification (SV) aims to detect an individual's identity from his/her voice. SV has been successfully applied in various areas such as access control, remote service customization, financial transactions, etc. Depending on whether the text content is pre-defined or not, SV can be text-dependent or text-independent. This paper reviews recent research on text-dependent SV (TD-SV) and text-independent SV (TI-SV). Because most modern SV systems apply deep learning methods to boost performance, we focus on the studies that use deep speaker embedding, a technique representing a person's identity via a fixed-dimensional vector encoded from a variable-length utterance. Rather than detailing every existing SV system, we make an overview of the representative SV systems that have attracted wide attention. Furthermore, an increasing number of SV systems have been devoted to addressing real-world challenges such as reverberation and noise, and this has driven a large number of studies on practical SV. Therefore, the survey compares the existing SV systems in the Far-Field Speaker Verification Challenge 2020 (FFSVC 2020) to illustrate the most effective techniques for both TD-SV and TI-SV. INDEX TERMSText-dependent speaker verification, text-independent speaker verification, deep speaker embedding, far-field speaker verification

show abstract

Section: A Text-independent Speaker Verificationmentioning

confidence: 99%

A Survey on Text-Dependent and Text-Independent Speaker Verification

Lin

Mak

2022

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, because DFT can only be applied to deterministic or wide-sense stationary signals, it is not suitable for non-stationary speech signals [22]. To account for the nonstationarity of the convolutional feature maps in speaker embedding networks, short-time spectral pooling (STSP) was proposed in [23] by replacing DFT with short-time Fourier transform (STFT) [24]. It was shown in [23] that STSP is a generalized statistics pooling method.…”

Section: A Motivationmentioning

confidence: 99%

“…To account for the nonstationarity of the convolutional feature maps in speaker embedding networks, short-time spectral pooling (STSP) was proposed in [23] by replacing DFT with short-time Fourier transform (STFT) [24]. It was shown in [23] that STSP is a generalized statistics pooling method. This is because from a Fourier perspective, statistics pooling only exploits the DC (zero-frequency) components in the spectral domain, whereas STSP incorporates more spectral components besides the DC ones during aggregation and is able to retain richer speaker information.…”

Section: A Motivationmentioning

confidence: 99%

“…In attentive STSP, however, the pooling operation is performed completely in the spectral domain and inverse STFT is not required, which reduces computation and facilitates the aggregation. 2) Compared with vanilla STSP [23], attentive STSP applies a self-attention mechanism to highlight the contribution of the discriminative windowed segments in the pooling operation. The attention mechanism endows attentive STSP with higher discriminative power, and it is a new part not covered by [23].…”

Section: A Motivationmentioning

confidence: 99%

“…2) Compared with vanilla STSP [23], attentive STSP applies a self-attention mechanism to highlight the contribution of the discriminative windowed segments in the pooling operation. The attention mechanism endows attentive STSP with higher discriminative power, and it is a new part not covered by [23].…”

Section: A Motivationmentioning

confidence: 99%

See 2 more Smart Citations

Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding

Mak

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Most pooling methods in state-of-the-art speaker embedding networks are implemented in the temporal domain. However, due to the high non-stationarity in the feature maps produced from the last frame-level layer, it is not advantageous to use the global statistics (e.g., means and standard deviations) of the temporal feature maps as aggregated embeddings. This motivates us to explore stationary spectral representations and perform aggregation in the spectral domain. In this paper, we propose attentive short-time spectral pooling (attentive STSP) from a Fourier perspective to exploit the local stationarity of the feature maps. In attentive STSP, for each utterance, we compute the spectral representations through a weighted average of the windowed segments within each spectrogram by attention weights and aggregate their lowest spectral components to form the speaker embedding. Because most of the feature map energy is concentrated in the low-frequency region of the spectral domain, attentive STSP facilitates the information aggregation by retaining the low spectral components only. Attentive STSP is shown to consistently outperform attentive pooling on VoxCeleb1, VOiCES19-eval, SRE16-eval, and SRE18-CMN2-eval. This observation suggests that applying segment-level attention and leveraging low spectral components can produce discriminative speaker embeddings.

show abstract

Multi-Fisher and Triple-Domain Feature Enhancement-Based Short Utterance Speaker Verification for IoT Smart Service

Zi,

Xiong

2024

IEEE Internet Things J.

View full text Add to dashboard Cite

Short-Time Spectral Aggregation for Speaker Embedding

Cited by 5 publications

References 18 publications

A Survey on Text-Dependent and Text-Independent Speaker Verification

A Survey on Text-Dependent and Text-Independent Speaker Verification

Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding

Multi-Fisher and Triple-Domain Feature Enhancement-Based Short Utterance Speaker Verification for IoT Smart Service

Contact Info

Product

Resources

About