Language Identification Using Deep Convolutional Recurrent Neural Networks

Bartz, Christian; Herold, Tom; Yang, Haojin; Meinel, Christoph

doi:10.1007/978-3-319-70136-3_93

Cited by 82 publications

(69 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Audio files are a sequence of spoken words, hence they have temporal features too.A CNN is better at capturing spatial features only and RNNs are better at capturing temporal features as demonstrated by Bartz et al [1] using audio files. Therefore, we combined both of these to make a CRNN model.…”

Section: Motivationsmentioning

confidence: 98%

Spoken Language Identification Using ConvNets

Sarthak

Shukla

Mittal

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying languages, we can either adopt an implicit approach where only the speech for a language is present or an explicit one where text is available with its corresponding transcript. This paper focuses on an implicit approach due to the absence of transcriptive data. This paper benchmarks existing models and proposes a new attention based model for language identification which uses log-Mel spectrogram images as input. We also present the effectiveness of raw waveforms as features to neural network models for LI tasks. For training and evaluation of models, we classified six languages (English, French, German, Spanish, Russian and Italian) with an accuracy of 95.4% and four languages (English, French, German, Spanish) with an accuracy of 96.3% obtained from the VoxForge dataset. This approach can further be scaled to incorporate more languages.

show abstract

Section: Motivationsmentioning

confidence: 98%

Spoken Language Identification Using ConvNets

Sarthak

Shukla

Mittal

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Our 2D CRNN architecture is shown in Figure 1. The architecture was inspired by Bartz et al [26], who applied 2D CRNNs for language identification in text documents. We applied a similar architecture for SAD.…”

Section: Network Descriptionmentioning

confidence: 99%

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

Vafeiadis

Fanioudakis²,

Potamitis

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Speech Activity Detection (SAD) plays an important role in mobile communications and automatic speech recognition (ASR). Developing efficient SAD systems for real-world applications is a challenging task due to the presence of noise. We propose a new approach to SAD where we treat it as a twodimensional multilabel image classification problem. To classify the audio segments, we compute their Short-time Fourier Transform spectrograms and classify them with a Convolutional Recurrent Neural Network (CRNN), traditionally used in image recognition. Our CRNN uses a sigmoid activation function, max-pooling in the frequency domain, and a convolutional operation as a moving average filter to remove misclassified spikes. On the development set of Task 1 of the 2019 Fearless Steps Challenge, our system achieved a decision cost function (DCF) of 2.89%, a 66.4% improvement over the baseline. Moreover, it achieved a DCF score of 3.318% on the evaluation dataset of the challenge, ranking first among all submissions.

show abstract

“…This enables training CNN even when the available training data is not as large as that required by other deep architectures. On the other hand, LSTM‐RNN (Zazo, Lozano‐Diez, Gonzalez‐Dominguez, Toledano, & Gonzalez‐Rodriguez, ; Zhang et al, ) is another powerful tool as it captures information through a sequence of time steps for language identification (Bartz, Herold, Yang, & Meinel, ). In the existing literature, (Bartz et al, ; Mounika et al, ) different architectures, like DNN and hybrid Convolutional RNN, are used to model the features extracted from the frames constituting the utterance.…”

Section: Introductionmentioning

confidence: 99%

“…On the other hand, LSTM‐RNN (Zazo, Lozano‐Diez, Gonzalez‐Dominguez, Toledano, & Gonzalez‐Rodriguez, ; Zhang et al, ) is another powerful tool as it captures information through a sequence of time steps for language identification (Bartz, Herold, Yang, & Meinel, ). In the existing literature, (Bartz et al, ; Mounika et al, ) different architectures, like DNN and hybrid Convolutional RNN, are used to model the features extracted from the frames constituting the utterance. However, in tonal languages, tonal events are prominent within a syllable (Atterer & Ladd, ) and, therefore, features should preferably be extracted syllable by syllable.…”

Section: Introductionmentioning

confidence: 99%

Cascade convolutional neural network‐long short‐term memory recurrent neural networks for automatic tonal and nontonal preclassification‐based Indian language identification

Bhanja

Laskar

2020

Expert Systems

View full text Add to dashboard Cite

This work presents an automatic tonal/nontonal preclassification-based Indian language identification (LID) system. Languages are firstly classified into tonal and nontonal categories, and then, individual languages are identified from the languages of the respective categories. This work proposes the use of pitch Chroma and formant features for this task, and also investigates how Mel-frequency Cepstral Coefficients (MFCCs) complement these features. It further explores block processing (BP), pitch synchronous analysis (PSA)-and glottal closure regions (GCRs)-based approaches for feature extraction, using syllables as basic units. Cascade convolutional neural network (CNN)-long short-term memory (LSTM) model using syllable-level features has been developed. National Institute of Technology Silchar language database (NITS-LD) and OGI-Multilingual Telephone Speech Corpus (OGI-MLTS) have been used for experimental validation. The proposed system based on the score combination of Cascade CNN-LSTM models of Chroma (extracted from BP method), first two formants and MFCCs (both extracted from GCR method) reports the highest accuracies. In the preclassification stage, the observed accuracies are 91%, 87.3%, and 85.1% for NITS-LD, for 30 s, 10 s, and 3 s test data respectively. For OGI-MLTS database, the respective accuracies are 86.7%, 83.1%, and 80.6%. That amounts to absolute improvements of 11.6%, 12.3%, and 13.9% for NITS-LD, and 12.5%, 11.9%, and 12.6% for OGI-MLTS database with respect to that of the baseline system. The proposed preclassification-based LID system shows improvements of 7.3%, 6.4%, and 7.4% for NITS-LD and 6.1%, 6.7%, and 7.2% for OGI-MLTS database over the baseline system for the three respective test data conditions.

show abstract

Language Identification Using Deep Convolutional Recurrent Neural Networks

Cited by 82 publications

References 15 publications

Spoken Language Identification Using ConvNets

Spoken Language Identification Using ConvNets

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

Cascade convolutional neural network‐long short‐term memory recurrent neural networks for automatic tonal and nontonal preclassification‐based Indian language identification

Contact Info

Product

Resources

About