2017
DOI: 10.1007/978-3-319-70136-3_93
|View full text |Cite
|
Sign up to set email alerts
|

Language Identification Using Deep Convolutional Recurrent Neural Networks

Abstract: Abstract. Language Identification (LID) systems are used to classify the spoken language from a given audio sample and are typically the first step for many spoken language processing tasks, such as Automatic Speech Recognition (ASR) systems. Without automatic language detection, speech utterances cannot be parsed correctly and grammar rules cannot be applied, causing subsequent speech recognition steps to fail. We propose a LID system that solves the problem in the image domain, rather than the audio domain. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
63
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 82 publications
(69 citation statements)
references
References 15 publications
0
63
0
Order By: Relevance
“…Audio files are a sequence of spoken words, hence they have temporal features too.A CNN is better at capturing spatial features only and RNNs are better at capturing temporal features as demonstrated by Bartz et al [1] using audio files. Therefore, we combined both of these to make a CRNN model.…”
Section: Motivationsmentioning
confidence: 98%
“…Audio files are a sequence of spoken words, hence they have temporal features too.A CNN is better at capturing spatial features only and RNNs are better at capturing temporal features as demonstrated by Bartz et al [1] using audio files. Therefore, we combined both of these to make a CRNN model.…”
Section: Motivationsmentioning
confidence: 98%
“…Our 2D CRNN architecture is shown in Figure 1. The architecture was inspired by Bartz et al [26], who applied 2D CRNNs for language identification in text documents. We applied a similar architecture for SAD.…”
Section: Network Descriptionmentioning
confidence: 99%
“…This enables training CNN even when the available training data is not as large as that required by other deep architectures. On the other hand, LSTM‐RNN (Zazo, Lozano‐Diez, Gonzalez‐Dominguez, Toledano, & Gonzalez‐Rodriguez, ; Zhang et al, ) is another powerful tool as it captures information through a sequence of time steps for language identification (Bartz, Herold, Yang, & Meinel, ). In the existing literature, (Bartz et al, ; Mounika et al, ) different architectures, like DNN and hybrid Convolutional RNN, are used to model the features extracted from the frames constituting the utterance.…”
Section: Introductionmentioning
confidence: 99%
“…On the other hand, LSTM‐RNN (Zazo, Lozano‐Diez, Gonzalez‐Dominguez, Toledano, & Gonzalez‐Rodriguez, ; Zhang et al, ) is another powerful tool as it captures information through a sequence of time steps for language identification (Bartz, Herold, Yang, & Meinel, ). In the existing literature, (Bartz et al, ; Mounika et al, ) different architectures, like DNN and hybrid Convolutional RNN, are used to model the features extracted from the frames constituting the utterance. However, in tonal languages, tonal events are prominent within a syllable (Atterer & Ladd, ) and, therefore, features should preferably be extracted syllable by syllable.…”
Section: Introductionmentioning
confidence: 99%