When CTC Training Meets Acoustic Landmarks

He, Di; Yang, Xuesong; Lim, Boon Pang; Liang, Yi; Hasegawa‐Johnson, Mark; Chen, Deming

doi:10.1109/icassp.2019.8683607

Cited by 5 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a CTC system with character outputs, however, it is difficult to share data for multilingual [16] or cross-lingual [17] ASR. Proposed solu-tions have included separate softmax tiers for the character set of each language [18,19,20], or the generation of phone strings instead of characters as the output of the CTC [21,22,23], or the use of both methods, in a multi-task learning framework, with one output tier generating phones, while another generates characters [24].…”

Section: Introductionmentioning

confidence: 99%

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Li,

Hasegawa-Johnson

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.

show abstract

Section: Introductionmentioning

confidence: 99%

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Li,

Hasegawa-Johnson

2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The MTL approach is applied to neural networks by sharing some of the hidden layers between different tasks. Some research could improve the accuracy of CTC-based ASR by incorporating acoustic landmarks, which could help CTC training converge more rapidly and smoothly [66,67]. Moreover, the information of acoustic landmarks could be obtained, which could be used as an additional information source, to further improve the performance of the APED system [68].…”

mentioning

confidence: 99%

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Zhang

Zhao

et al. 2020

Sensors

View full text Add to dashboard Cite

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.

show abstract

Does A Priori Phonological Knowledge Improve Cross-Lingual Robustness of Phonemic Contrasts?

Skidmore

Gutkin

2020

Speech and Computer

View full text Add to dashboard Cite

When CTC Training Meets Acoustic Landmarks

Cited by 5 publications

References 22 publications

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Does A Priori Phonological Knowledge Improve Cross-Lingual Robustness of Phonemic Contrasts?

Contact Info

Product

Resources

About