Regularization of neural network model with distance metric learning for i-vector based spoken language identification

Lu, Xugang; Shen, Peng; Tsao, Yu; Kawai, Hisashi

doi:10.1016/j.csl.2017.01.006

Cited by 16 publications

(7 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we also take the DCNN as our baseline modeling architecture. As explained in introduction, although the DCNN model for classification is try to learn the input-target mapping function, we can regard the processing as two coupled functions of feature extraction and classifier modeling as we did before [21]. The coupling network and classification score calculation are illustrated in Fig.…”

Section: Deep Convolutional Neural Network For Aedmentioning

confidence: 99%

“…For example, in a large category of machine learning, feature learning takes into account of intra-and inter-class pair-wise distance measurements [14,15,16]. In the DL framework, nonlinear distance metric learning has been proposed for different applications [17,18,19,20,21], they all take a similar idea in feature extraction with the pair-wise Siamese network models as originally proposed in [23,24,19]. As a further generalization of the idea based on pair-wise Siamese network for feature extraction, triplet loss was proposed [22].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Class-Wise Centroid Distance Metric Learning for Acoustic Event Detection

Shen

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Designing good feature extraction and classifier models is essential for obtaining high performances of acoustic event detection (AED) systems. Current state-of-the-art algorithms are based on deep neural network models that jointly learn the feature representation and classifier models. As a typical pipeline in these algorithms, several network layers with nonlinear transforms are stacked for feature extraction, and a classifier layer with a softmax transform is applied on top of these extracted features to obtain normalized probability outputs. This pipeline is directly connected to a final goal for class discrimination without explicitly considering how the features should be distributed for inter-class and intra-class samples. In this paper, we explicitly add a distance metric constraint on feature extraction process with a goal to reduce intra-class sample distances and increase inter-class sample distances. Rather than estimating the pair-wise distances of samples, the distances are efficiently calculated between samples and class cluster centroids. With this constraint, the learned features have a good property for improving the generalization of the classification models. AED experiments on an urban sound classification task were carried out to test the algorithm. Results showed that the proposed algorithm efficiently improved the performance on the current state-of-the-art deep learning algorithms.

show abstract

Section: Deep Convolutional Neural Network For Aedmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Class-Wise Centroid Distance Metric Learning for Acoustic Event Detection

Shen

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…The i-vectors were 400dimensional vectors that obtained on the full-length duration utterances (Average 7.6s) with the script of Kaldi toolkit [17]. For SVM classifier, we used the radial basis function (RBF) kernel and a grid search with cross-validation following the work [18]. The DNN model were with two hidden layers with 512 neurons for each, and a dropout of 0.3 was applied.…”

Section: Implementation Of Baseline Systemsmentioning

confidence: 99%

Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification

Shen

et al. 2018

Interspeech 2018

Self Cite

View full text Add to dashboard Cite

The performance of spoken language identification (LID) on short utterances is drastically degraded even though model is completely trained on short utterance data set. The degradation is because of the large pattern confusion caused by the large variation of feature representation on short utterances. In this paper, we propose a teacher-student network learning algorithm to explore discriminative features for short utterances. With the teacher-student network learning, the feature representation for short utterances (explored by the student network) are normalized to their representations corresponding to long utterances (provided by the teacher network). With this learning algorithm, the feature representation on short utterances is supposed to reduce pattern confusion. Experiments on a 10-language LID task were carried out to test the algorithm. Our results showed the proposed algorithm significantly improved the performance.

show abstract

“…The generated data were created by adding random uniformly distributed noise over the interval [-1, 1] to the original i-vector data. We also compared the proposed method with conventional methods, i.e., cosine distance, support vector machine (SVM) with linear and radial basis function (RBF) kernels [19]. The optimal model parameters of SVMs were obtained based on a grid search with cross-validation.…”

Section: Implementation Of Baseline Systemsmentioning

confidence: 99%

“…To improve the generalization of the model, regularization methods, such as weight decay [16], dropout [17], data augmentation, have been proposed. For i-vector-based LID tasks, previous works have already investigated DNN with dropout [15] and distance metric learning [18,19], with limited training data.…”

Section: Introductionmentioning

confidence: 99%

Conditional Generative Adversarial Nets Classifier for Spoken Language Identification

Shen¹,

Lu²,

Li³

et al. 2017

Interspeech 2017

Self Cite

View full text Add to dashboard Cite

The i-vector technique using deep neural network has been successfully applied in spoken language identification systems. Neural network modeling showed its effectiveness as both discriminant feature transformation and classification in many tasks, in particular with a large training data set. However, on a small data set, neural networks suffer from the overfitting problem which degrades the performance. Many strategies have been investigated and used to improve the regularization for deep neural networks, for example, weigh decay, dropout, data augmentation. In this paper, we study and use conditional generative adversarial nets as a classifier for the spoken language identification task. Unlike the previous works on GAN for image generation, our purpose is to focus on improving regularization of the neural network by jointly optimizing the "Real/Fake" objective function and the categorical objective function. Compared with dropout and data augmentation methods, the proposed method obtained 29.7% and 31.8% relative improvement on NIST 2015 i-vector challenge data set for spoken language identification.

show abstract

Regularization of neural network model with distance metric learning for i-vector based spoken language identification

Cited by 16 publications

References 12 publications

Class-Wise Centroid Distance Metric Learning for Acoustic Event Detection

Class-Wise Centroid Distance Metric Learning for Acoustic Event Detection

Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification

Conditional Generative Adversarial Nets Classifier for Spoken Language Identification

Contact Info

Product

Resources

About