Tone Recognition Using Lifters and CTC

Lugosch, Loren; Tomar, Vikrant Singh

doi:10.21437/interspeech.2018-2293

Cited by 8 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, at the same time, there are still many works using the ReLU activation F (x) = max{x, 0} [7,19,[22][23][24]27,28].…”

Section: Activationsmentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Wang

2019

Symmetry

View full text Add to dashboard Cite

Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

show abstract

“…However, at the same time, there are still many works using the ReLU activation F (x) = max{x, 0} [7,19,[22][23][24]27,28].…”

Section: Activationsmentioning

confidence: 99%

“…Zhang [21] uses it as test data to evaluate language model. Lugosch [22] uses it to recognize tones in continuous speech for tonal languages.…”

mentioning

confidence: 99%

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Wang

2019

Symmetry

View full text Add to dashboard Cite

show abstract

“…Frame-based approach feeds a sequence frames directly to the classifier, which outputs a sequence of labels. To capture context information, RNN [3,13] and CNN [3] are frequently used. Also, frame-based frameworks often use techniques such as pooling [12], attention [3], or Connectionist Temporal Classification (CTC) [13] to perform frame-level alignment to correctly output a series of tone labels.…”

Section: Related Workmentioning

confidence: 99%

“…To capture context information, RNN [3,13] and CNN [3] are frequently used. Also, frame-based frameworks often use techniques such as pooling [12], attention [3], or Connectionist Temporal Classification (CTC) [13] to perform frame-level alignment to correctly output a series of tone labels. This approach allows a single training pass to cover both the alignment and classification task, and does not require a pretrained ASR model.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Mandarin Tone Classification with Short Term Context Information

Tang¹,

Li²

2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose an end-to-end Mandarin tone classification method from continuous speech utterances utilizing both the spectrogram and the short term context information as the inputs. Both Mel-spectrograms and context segment features are used to train the tone classifier. We first divide the spectrogram frames into syllable segments using force alignment results produced by an ASR model. Then we extract the short term segment features to capture the context information across multiple syllables. Feeding both the Mel-spectrogram and the short term context segment features into an end-to-end model could significantly improve the performance. Experiments are performed on a large scale open source Mandarin speech dataset to evaluate the proposed method. Results show that the this method improves the classification accuracy from 79.5% to 88.7% on the AISHELL3 database.

show abstract