Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2293
|View full text |Cite
|
Sign up to set email alerts
|

Tone Recognition Using Lifters and CTC

Abstract: In this paper, we present a new method for recognizing tones in continuous speech for tonal languages. The method works by converting the speech signal to a cepstrogram, extracting a sequence of cepstral features using a convolutional neural network, and predicting the underlying sequence of tones using a connectionist temporal classification (CTC) network. The performance of the proposed method is evaluated on a freely available Mandarin Chinese speech corpus, AISHELL-1, and is shown to outperform the existin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
8
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 14 publications
0
8
0
Order By: Relevance
“…However, at the same time, there are still many works using the ReLU activation F (x) = max{x, 0} [7,19,[22][23][24]27,28].…”
Section: Activationsmentioning
confidence: 99%
See 1 more Smart Citation
“…However, at the same time, there are still many works using the ReLU activation F (x) = max{x, 0} [7,19,[22][23][24]27,28].…”
Section: Activationsmentioning
confidence: 99%
“…Zhang [21] uses it as test data to evaluate language model. Lugosch [22] uses it to recognize tones in continuous speech for tonal languages.…”
mentioning
confidence: 99%
“…Frame-based approach feeds a sequence frames directly to the classifier, which outputs a sequence of labels. To capture context information, RNN [3,13] and CNN [3] are frequently used. Also, frame-based frameworks often use techniques such as pooling [12], attention [3], or Connectionist Temporal Classification (CTC) [13] to perform frame-level alignment to correctly output a series of tone labels.…”
Section: Related Workmentioning
confidence: 99%
“…To capture context information, RNN [3,13] and CNN [3] are frequently used. Also, frame-based frameworks often use techniques such as pooling [12], attention [3], or Connectionist Temporal Classification (CTC) [13] to perform frame-level alignment to correctly output a series of tone labels. This approach allows a single training pass to cover both the alignment and classification task, and does not require a pretrained ASR model.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation