The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by extracting long-term statistical summary of the recording assuming an independence of the feature frames. In this paper, we propose a neural network framework utilizing short-sequence information in language recognition. In particular, a new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. This relevance weighting is achieved using the bidirectional long short-term memory (BLSTM) network with attention modeling. We explore two approaches, the first approach uses segment level i-vector/x-vector representations that are aggregated in the neural model and the second approach where the acoustic features are directly modeled in an end-to-end neural model. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data as well as in the RATS language recognition corpus. In these experiments on noisy LRE tasks as well as the RATS dataset, the proposed approach yields significant improvements over the conventional i-vector/x-vector based language recognition approaches as well as with other previous models incorporating sequence information.
The language recognition evaluation (LRE) 2017 challenge comprises an open evaluation of the language identification (LID) task on a set of 14 languages/dialects. In this paper, we describe our submission to the LRE 2017 challenge fixed condition which consisted of developing various LID systems using i-vector based modeling. The front end processing is performed using deep neural network (DNN) based bottleneck features for i-vector modeling with a Gaussian mixture model (GMM) universal background model (UBM) approach. Several back-end systems consisting of support vector machines (SVMs) and deep neural network (DNN) models were used for the language/dialect classification. The submission system achieved significant improvements over the evaluation baseline system provided by NIST (relative improvements of more than 50% over the baseline). In the later part of the paper, we detail our post evaluation efforts to improve the language recognition system for short duration speech data using novel approaches of sequence modeling of segment i-vectors. The post evaluation efforts resulted in further improvements over the submitted system (relative improvements of about 22 %). An error analysis is also presented which highlights the confusions and errors in the final system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.