Sarthak Yadav scite author profile

Sarthak Yadav

5Publications

74Citation Statements Received

80Citation Statements Given

How they've been cited

128

How they cite others

Affiliations

Aalborg University, University of Glasgow, Idiap Research Institute

Publications

Order By: Most citations

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav¹,

Rai²

2020

View full text Add to dashboard Cite

Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the Vox-Celeb [2, 3] speaker verification benchmark. Our best model achieves an equal error rate of 2.031% on the VoxCeleb1 test set, which is a considerable improvement over comparable state of the art results. For a more thorough assessment of the effects of frequency and temporal attention in real-world conditions, we conduct ablation experiments by randomly dropping frequency bins and temporal frames from the input spectrograms, concluding that instead of modelling either of the entities, simultaneously modelling temporal and frequency attention translates to better real-world performance.

show abstract

Learning Discriminative Features for Speaker Identification and Verification

Yadav

Rai

2018

View full text Add to dashboard Cite

The success of any Text Independent Speaker Identification and/or Verification system relies upon the system's capability to learn discriminative features. In this paper we propose a Convolutional Neural Network (CNN) Architecture based on the popular Very Deep VGG [1] CNNs, with key modifications to accommodate variable length spectrogram inputs, reduce the model disk space requirements and reduce the number of parameters, resulting in significant reduction in training times. We also propose a unified deep learning system for both Text-Independent Speaker Recognition and Speaker Verification, by training the proposed network architecture under the joint supervision of Softmax loss and Center loss [2] to obtain highly discriminative deep features that are suited for both Speaker Identification and Verification Tasks. We use the recently released VoxCeleb dataset [3], which contains hundreds of thousands of real world utterances of over 1200 celebrities belonging to various ethnicities, for benchmarking our approach. Our best CNN model achieved a Top-1 accuracy of 84.6%, a 4% absolute improvement over Vox-Celeb's approach, whereas training in conjunction with Center Loss improved the Top-1 accuracy to 89.5%, a 9% absolute improvement over Voxceleb's approach.

show abstract

Automated Speech to Sign language Conversion using Google API and NLP

et al. 2019

View full text Add to dashboard Cite

Learning neural audio features without supervision

Yadav¹,

Zeghidour²

2022

View full text Add to dashboard Cite

Frequency and temporal convolutional attention for text-independent speaker recognition

Yadav¹,

Rai²

2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sarthak Yadav

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Learning Discriminative Features for Speaker Identification and Verification

Automated Speech to Sign language Conversion using Google API and NLP

Learning neural audio features without supervision

Frequency and temporal convolutional attention for text-independent speaker recognition

Contact Info

Product

Resources

About