Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav, Sarthak; Rai, Atul

doi:10.1109/icassp40776.2020.9054440

Cited by 55 publications

(34 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [7], thin ResNet-34 architecture is used along with GhostVLAD pooling that results in 3.2% EER. In [8], a convolutional attention model is proposed for time and frequency dimensions and GhostVLAD based aggregation is applied and the model achieves 2.0%. The lowest EER on the VoxCeleb1 test set is reported in [13], where their best system makes use of data augmentation and system combination.…”

Section: Related Workmentioning

confidence: 99%

“…In Table 1, we present the EERs on VC1 and VC2 datasets. The upper part of the table includes audio-only (A-only) EER of [8,13] and attention based fusion proposed in [3]. In the lower part of the table, the first two rows show the unimodal performance of our systems.…”

Section: Experiments On Unimodal and Multimodal Modelsmentioning

confidence: 99%

“…A-only of [13] NA 1.0 A-only of [8] NA 2.0 AV of [3] 5.3 NA Our unimodal A-only 3.5 2.2 Our unimodal V-only 3.4 3.9 Mid-level AV fusion 2.0 1.4 Score fusion unimodal A+V 1.7 0.9 Score fusion A+V+AV 1.6 0.7 Table 1: EER (%) of various models on VC1 and VC2 test sets ror cases. First, there is a small percentage (estimate 1%) of the dataset which is labelled incorrectly.…”

Section: Model Description Vc2 Eer Vc1 Eermentioning

confidence: 99%

“…In recent studies, neural network based speaker embeddings, such as xvectors [6], are used. These systems usually process the input speech by a network that generates a sequence of features for the utterance which are then aggregated into a single vector to represent the speaker embedding [7,8]. These aggregation or summarization methods range from temporal pooling to cluster based approaches which are also used in computer vision studies such as NetVLAD [9] and GhostVLAD [10,7].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Multi-View Approach to Audio-Visual Speaker Verification

Sarı

Singh

Zhou

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Experiments On Unimodal and Multimodal Modelsmentioning

confidence: 99%

Section: Model Description Vc2 Eer Vc1 Eermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Multi-View Approach to Audio-Visual Speaker Verification

Sarı

Singh

Zhou

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…There are also embedding systems that use ResNets [10] or DenseNets [11] for frame-level processing. For example, lightweight ResNets were adapted from ResNet-34 and ResNet-50 in [12] and [13]. In [5], a deep embedding network was implemented based on DenseNets.…”

Section: Introductionmentioning

confidence: 99%

Short-Time Spectral Aggregation for Speaker Embedding

Mak

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

State-of-the-art speaker verification systems take frame-level acoustics features as input and produce fixed-dimensional embeddings as utterance-level representations. Thus, how to aggregate information from frame-level features is vital for achieving high performance. This paper introduces short-time spectral pooling (STSP) for better aggregation of frame-level information. STSP transforms the temporal feature maps of a speaker embedding network into the spectral domain and extracts the lowest spectral components of the averaged spectrograms for aggregation. Benefiting from the low-pass characteristic of the averaged spectrograms, STSP is able to preserve most of the speaker information in the feature maps using a few spectral components only. We show that statistics pooling is a special case of STSP where only the DC spectral components are used. Experiments on VoxCeleb1 and VOiCES 2019 show that STSP outperforms statistics pooling and multi-head attentive pooling, which suggests that leveraging more spectral information in the CNN feature maps can produce highly discriminative speaker embeddings.

show abstract