Insights in-to-End Learning Scheme for Language Identification

Cai, Weicheng; Cai, Zexin; Liu, Wenbo; Wang, Xiaoqi; Li, Ming

doi:10.1109/icassp.2018.8462026

Cited by 22 publications

(21 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, a flexible processing method should have the ability to accept speech segments with arbitrary duration. Motivated by [21,22,24], the whole end-to-end framework in this paper is shown in Fig. 3.…”

Section: End-to-end System Overviewmentioning

confidence: 99%

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Cai¹,

Chen²,

Li³

2018

The Speaker and Language Recognition Workshop (Odyssey 2018)

Self Cite

278

255

View full text Add to dashboard Cite

In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.arXiv:1804.05160v1 [eess.AS]

show abstract

Section: End-to-end System Overviewmentioning

confidence: 99%

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Cai¹,

Chen²,

Li³

2018

The Speaker and Language Recognition Workshop (Odyssey 2018)

Self Cite

278

255

View full text Add to dashboard Cite

show abstract

“…Short-time Cepstral Mean Subtraction (CMS) is applied with 3 s sliding window. For the end-to-end network, they use a residual network (ResNet) system with a global statistics pooling layer and a fully connected layer and each output layer is represented as target dialect class [25]. The model was trained with standard cross-entropy loss with a softmax layer.…”

Section: Adimentioning

confidence: 99%

The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech

Ali

Shon²,

Samih

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

This paper describes the fifth edition of the Multi-Genre Broadcast Challenge (MGB-5), an evaluation focused on Arabic speech recognition and dialect identification. MGB-5 extends the previous MGB-3 challenge in two ways: first it focuses on Moroccan Arabic speech recognition; second the granularity of the Arabic dialect identification task is increased from 5 dialect classes to 17, by collecting data from 17 Arabic speaking countries. Both tasks use YouTube recordings to provide a multi-genre multi-dialectal challenge in the wild. Moroccan speech transcription used about 13 hours of transcribed speech data, split across training, development, and test sets, covering 7-genres: comedy, cooking, family/kids, fashion, drama, sports, and science (TEDx). The fine-grained Arabic dialect identification data was collected from known YouTube channels from 17 Arabic countries. 3,000 hours of this data was released for training, and 57 hours for development and testing. The dialect identification data was divided into three sub-categories based on the segment duration: short (under 5 s), medium (5-20 s), and long (>20 s). Overall, 25 teams registered for the challenge, and 9 teams submitted systems for the two tasks. We outline the approaches adopted in each system and summarize the evaluation results.

show abstract

“…2. It is surprised that although LSTM or GRU layers introduce much more parameter than TAP layer, it results in degraded performance, especially for the testing task over a long-range duration [10].…”

Section: Introductionmentioning

confidence: 99%

Utterance-level End-to-end Language Identification Using Attention-based CNN-BLSTM

Cai

Huang

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level decision can be directly obtained from the output of the neural network. To handle speech utterances with entire arbitrary and potentially long duration, we combine CNN-BLSTM model with a self-attentive pooling layer together. The front-end CNN-BLSTM module plays a role as local pattern extractor for the variablelength inputs, and the following self-attentive pooling layer is built on top to get the fixed-dimensional utterance-level representation. We conducted experiments on NIST LRE07 closed-set task, and the results reveal that the proposed attention-based CNN-BLSTM model achieves comparable error reduction with other state-of-theart utterance-level neural network approaches for all 3 seconds, 10 seconds, 30 seconds duration tasks.

show abstract

Insights in-to-End Learning Scheme for Language Identification

Cited by 22 publications

References 28 publications

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech

Utterance-level End-to-end Language Identification Using Attention-based CNN-BLSTM

Contact Info

Product

Resources

About