ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054440
|View full text |Cite
|
Sign up to set email alerts
|

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Abstract: Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 55 publications
(34 citation statements)
references
References 21 publications
0
34
0
Order By: Relevance
“…In [7], thin ResNet-34 architecture is used along with GhostVLAD pooling that results in 3.2% EER. In [8], a convolutional attention model is proposed for time and frequency dimensions and GhostVLAD based aggregation is applied and the model achieves 2.0%. The lowest EER on the VoxCeleb1 test set is reported in [13], where their best system makes use of data augmentation and system combination.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…In [7], thin ResNet-34 architecture is used along with GhostVLAD pooling that results in 3.2% EER. In [8], a convolutional attention model is proposed for time and frequency dimensions and GhostVLAD based aggregation is applied and the model achieves 2.0%. The lowest EER on the VoxCeleb1 test set is reported in [13], where their best system makes use of data augmentation and system combination.…”
Section: Related Workmentioning
confidence: 99%
“…In Table 1, we present the EERs on VC1 and VC2 datasets. The upper part of the table includes audio-only (A-only) EER of [8,13] and attention based fusion proposed in [3]. In the lower part of the table, the first two rows show the unimodal performance of our systems.…”
Section: Experiments On Unimodal and Multimodal Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…There are also embedding systems that use ResNets [10] or DenseNets [11] for frame-level processing. For example, lightweight ResNets were adapted from ResNet-34 and ResNet-50 in [12] and [13]. In [5], a deep embedding network was implemented based on DenseNets.…”
Section: Introductionmentioning
confidence: 99%