Zhili Tan scite author profile

Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.

show abstract

DNN-Based Score Calibration With Multitask Learning for Noise Robust Speaker Verification

Tan

Mak

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This paper proposes and investigates several deep neural network (DNN)-based score compensation, transformation and calibration algorithms for enhancing the noise robustness of i-vector speaker verification systems. Unlike conventional calibration methods where the required score shift is a linear function of SNR or log-duration, the DNN approach learns the complex relationship between the score shifts and the combination of i-vector pairs and uncalibrated scores. Furthermore, with the flexibility of DNNs, it is possible to explicitly train a DNN to recover the clean scores without having to estimate the score shifts. To alleviate the overfitting problem, multi-task learning is applied to incorporate auxiliary information such as SNRs and speaker ID of training utterances into the DNN. Experiments on NIST 2012 SRE show that score calibration derived from multi-task DNNs can improve the performance of the conventional score-shift approch significantly, especially under noisy conditions.

show abstract

Addressing Accent Mismatch In Mandarin-English Code-Switching Speech Recognition

Tan

Fan

Zhu

et al. 2020

View full text Add to dashboard Cite

i-Vector DNN Scoring and Calibration for Noise Robust Speaker Verification

Tan

Mak

2017

View full text Add to dashboard Cite

This paper proposes applying multi-task learning to train deep neural networks (DNNs) for calibrating the PLDA scores of speaker verification systems under noisy environments. To facilitate the DNNs to learn the main task (calibration), several auxiliary tasks were introduced, including the prediction of SNR and duration from i-vectors and classifying whether an i-vector pair belongs to the same speaker or not. The possibility of replacing the PLDA model by a DNN during the scoring stage is also explored. Evaluations on noise contaminated speech suggest that the auxiliary tasks are important for the DNNs to learn the main calibration task and that the uncalibrated PLDA scores are an essential input to the DNNs. Without this input, the DNNs can only predict the score shifts accurately, suggesting that the PLDA model is indispensable.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zhili Tan

Bottleneck features from SNR-adaptive denoising deep classifier for speaker identification

An End-to-end Architecture of Online Multi-channel Speech Separation

Senone I-vectors for robust speaker verification

Effect of temperature on color and chemical composition of poplar powder compacts during warm-press forming

An End-to-End Architecture of Online Multi-Channel Speech Separation

DNN-Based Score Calibration With Multitask Learning for Noise Robust Speaker Verification

Addressing Accent Mismatch In Mandarin-English Code-Switching Speech Recognition

i-Vector DNN Scoring and Calibration for Noise Robust Speaker Verification

Contact Info

Product

Resources

About