2022
DOI: 10.3390/s22062147
|View full text |Cite
|
Sign up to set email alerts
|

Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Abstract: Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent informat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 38 publications
0
3
0
Order By: Relevance
“…Therefore, a flexible processing method should have the ability to accept audio of any duration and obtain the fixed-dimensional features. In speaker recognition systems, the GAP aggregation model [ 10 , 20 , 24 ] is the most commonly used method for aggregating the frame-level features into fixed-dimensional utterance-level features. The reference [ 7 ] employs statistical pooling (SP) to aggregate the features, which computes the mean vector of the frame-level features and the standard deviation vector of the second-order statistics; then, it stitches them together as an utterance-level feature.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Therefore, a flexible processing method should have the ability to accept audio of any duration and obtain the fixed-dimensional features. In speaker recognition systems, the GAP aggregation model [ 10 , 20 , 24 ] is the most commonly used method for aggregating the frame-level features into fixed-dimensional utterance-level features. The reference [ 7 ] employs statistical pooling (SP) to aggregate the features, which computes the mean vector of the frame-level features and the standard deviation vector of the second-order statistics; then, it stitches them together as an utterance-level feature.…”
Section: Related Workmentioning
confidence: 99%
“…It is calculated in the same way as the SP aggregation model, but has better performance than SP. They also introduced the NetVLAD aggregation model in computer vision that aggregates features into fixed dimensions via clustering [ 20 , 28 ]. NetVLAD assigns each frame-level feature to a different cluster center and encodes the residuals as output features.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Then we can conclude that the receptive field is 1 after convolution with a 1×1 convolution kernel with a step length of 1. It is also possible to derive the per-layer receptive field of the Bottleneck Block, as shown in Equation (7), and the receptive field of parallel branches in the Split-ResNet Block, as shown in Equation (8). From the comparison results of Equation (7) and Equation ( 8), it can be concluded that Split-ResNet has more and larger receptive field, which makes Split-ResNet have stronger feature extraction ability than ResNet and can produce better results than the original network.…”
Section: A Split-resnetmentioning
confidence: 99%
“…But with the development of deep learning, deep learning networks (DNNs) have brought breakthroughs in speaker recognition. The DNN architecture system is capable of directly processing the input audio data, then extracting the frame-level features of the input audio through DNN and aggregating the features through an aggregation model to aggregate the extracted frame-level features into discourse-level features, thus making deep learning dominant in the direction of speaker recognition [7], [8].…”
Section: Introductionmentioning
confidence: 99%