Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Tang, Yun; Ding, Guohong; Huang, Jing; He, Xiaodong; Zhou, Bowen

doi:10.1109/icassp.2019.8682712

Cited by 71 publications

(41 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further experiments on WSJ and LibriSpeech show that our attention mechanism could achieve the best performance among all end-to-end methods without data augmentation, and it is only slightly worse than the state-of-the-art performance. In the future, we would study other ways of utilizing multi-level information [43].…”

Section: Resultsmentioning

confidence: 99%

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Qin

Zhang

2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the endto-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multilevel outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-ofthe-art end-to-end system in WER.

show abstract

Section: Resultsmentioning

confidence: 99%

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Qin

Zhang

2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…Since the time delay neural network (TDNN) based xvector system was proposed [32], a series of similar improved networks have been investigated [18,19,25], and a variety of other neural network architectures, such as DenseNet, ResNet and InceptionNet, have also been used to extract more speaker-discriminative embeddings [21,23,24]. In order to obtain long-term speaker representation with more discriminative power, attention mechanism [17] is widely used recently.…”

Section: Introductionmentioning

confidence: 99%

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

Guo

Dai

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multiscale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields.(2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system.

show abstract

“…On the other hand, during testing, if not enough speech segments are detected, then the SV algorithms will not be able to detect the speaker. For this reason, the VAD has played a vital role in robust SV systems from traditional Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector systems [1,2] to recent deep speaker embedding systems [3,4,5,6].…”

Section: Introductionmentioning

confidence: 99%

Self-Adaptive Soft Voice Activity Detection Using Deep Neural Networks for Robust Speaker Verification

Jung¹,

Choi²,

Kim³

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posteriorbased DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in realworld environments by using self-adaptive soft VAD.

show abstract

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Cited by 71 publications

References 20 publications

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

Self-Adaptive Soft Voice Activity Detection Using Deep Neural Networks for Robust Speaker Verification

Contact Info

Product

Resources

About