Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1522
|View full text |Cite
|
Sign up to set email alerts
|

Two-Stage Training for Chinese Dialect Recognition

Abstract: In this paper, we present a two-stage language identification (LID) system based on a shallow ResNet14 followed by a simple 2-layer recurrent neural network (RNN) architecture, which was used for Xunfei (iFlyTek) Chinese Dialect Recognition Challenge 1 and won the first place among 110 teams. The system trains an acoustic model (AM) firstly with connectionist temporal classification (CTC) to recognize the given phonetic sequence annotation and then train another RNN to classify dialect category by utilizing th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
5
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(7 citation statements)
references
References 29 publications
1
5
0
Order By: Relevance
“…Using a Conformer sequence-to-sequence feature extractor, we have successfully trained ASR-based multilingual bottleneck features without explicltly performing forced phone alignment. A similar behavior was observed for a dialect recognition task [26]. We compare these features with a very strong baseline: classical multilingual bottleneck features [11].…”
Section: Discussionsupporting
confidence: 57%
See 1 more Smart Citation
“…Using a Conformer sequence-to-sequence feature extractor, we have successfully trained ASR-based multilingual bottleneck features without explicltly performing forced phone alignment. A similar behavior was observed for a dialect recognition task [26]. We compare these features with a very strong baseline: classical multilingual bottleneck features [11].…”
Section: Discussionsupporting
confidence: 57%
“…With this approach, state-of-the-art language recognition performance has been achieved without defining a frame alignment of phone labels, using only one language for the ASR task [7]. Moreover, for Chinese dialect recognition, the use of the phone forced alignment performed by the acoustic model trained with CTC loss does not improve language recognition performance over the direct use of the sequence-to-sequence bottleneck features [26]. Multi-task training of a joint ASR and LID model has been successfuly performed for English and Hindi corpora [27].…”
Section: Introductionmentioning
confidence: 99%
“…Adding a large number of pronunciations of each single word also increases the computational cost because adding alternatives increases the search space. Because of all such reasons, recently, the research community has diverted his attention towards DNN-based techniques and, in the recent past, has proposed various DNN-based approaches to achieve accent and dialect identification tasks more effectively [50][51][52][53][54][55].…”
Section: Related Workmentioning
confidence: 99%
“…In this work, we did not invest resources to develop an acceptable automatic speech recognition system for each target language with the goal of performing forced phone alignment. We simply trained a unique multilingual end-to-end speech recognition system with the connectionist temporal classification (CTC) loss [21], extending the idea of [22,23]. We used the Conformer architecture [16] with an output layer specific to each target language, with 64 mel-filterbank features as input.…”
Section: Bottleneck Featuresmentioning
confidence: 99%