Training Multi-task Adversarial Network for Extracting Noise-robust Speaker Embedding

Zhou, Jianfeng; Jiang, Tao; Li, Lin; Hong, Qingyang; Wang, Zhe; Xia, Bing-yin

doi:10.1109/icassp.2019.8683828

Cited by 41 publications

(35 citation statements)

References 23 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is the definition of path. Since p i,w i is the probability of output the w i -th element of V + { − } at time i, the probability of the path P can be calculated as Equation (19).…”

Section: Ctcmentioning

confidence: 99%

“…However, at the same time, there are still many works using the ReLU activation F (x) = max{x, 0} [7,19,[22][23][24]27,28].…”

Section: Activationsmentioning

confidence: 99%

“…For example, Chen [18] uses AISHELL-1 as a sub-task in multi-task model to help recognizing under-resourced languages such as Vietnamese and Singapore Hokkien. Zhou [19] uses it for speaker embedding. Tu [20] uses it for automatic pronunciation evaluation.…”

mentioning

confidence: 99%

See 2 more Smart Citations

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Wang

2019

Symmetry

View full text Add to dashboard Cite

Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

show abstract

“…This is the definition of path. Since p i,w i is the probability of output the w i -th element of V + { − } at time i, the probability of the path P can be calculated as Equation (19).…”

Section: Ctcmentioning

confidence: 99%

“…However, at the same time, there are still many works using the ReLU activation F (x) = max{x, 0} [7,19,[22][23][24]27,28].…”

Section: Activationsmentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Wang

2019

Symmetry

View full text Add to dashboard Cite

show abstract

“…We use CTC loss to train the AM so the network outputs can align with the phoneme sequences automatically and use cross-entropy loss to discriminate between dialects. Compared with multi-task training [28,29] in SV tasks, it should be emphasized that these stages should be trained step by step instead of multi-task learning with shared layers, that is to say we backpropagate the whole network while training AM, and only backpropagate the RNN part in the second stage, or the network will be degenerated and lost the information of acoustic knowledge.…”

Section: Loss Functionmentioning

confidence: 99%

Two-Stage Training for Chinese Dialect Recognition

Ren

Yang²,

2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we present a two-stage language identification (LID) system based on a shallow ResNet14 followed by a simple 2-layer recurrent neural network (RNN) architecture, which was used for Xunfei (iFlyTek) Chinese Dialect Recognition Challenge 1 and won the first place among 110 teams. The system trains an acoustic model (AM) firstly with connectionist temporal classification (CTC) to recognize the given phonetic sequence annotation and then train another RNN to classify dialect category by utilizing the intermediate features as inputs from the AM. Compared with a three-stage system we further explore, our results show that the two-stage system can achieve high accuracy for Chinese dialects recognition under both short utterance and long utterance conditions with less training time.

show abstract

“…Within this framework, there are two main methods. The first one regards the noisy data as a different domain from the clean data and applies adversarial training to deal with domain mismatch and get a noise-invariant speaker embedding [14,15]. The second method employs a DNN speech enhancement network for ASV tasks.…”

Section: Introductionmentioning

confidence: 99%

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Cai

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the "clean" embedding of the noisy utterance. Specifically, the network is trained with the original speaker identification loss with an auxiliary within-sample variability-invariant loss. This auxiliary variability-invariant loss is used to learn the same embedding among the clean utterance and its noisy copies and prevents the network from encoding the undesired noises or variabilities into the speaker representation. Furthermore, we investigate the data preparation strategy for generating clean and noisy utterance pairs on-the-fly. The strategy generates different noisy copies for the same clean utterance at each training step, helping the speaker embedding network generalize better under noisy environments. Experiments on VoxCeleb1 indicate that the proposed training framework improves the performance of the speaker verification system in both clean and noisy conditions.

show abstract

Training Multi-task Adversarial Network for Extracting Noise-robust Speaker Embedding

Cited by 41 publications

References 23 publications

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Two-Stage Training for Chinese Dialect Recognition

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Contact Info

Product

Resources

About