Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

He, Di; Lim, Boon Pang; Yang, Xuesong; Hasegawa‐Johnson, Mark; Chen, Deming

doi:10.1121/1.5039837

Cited by 11 publications

(9 citation statements)

References 33 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Landmark-based ASR has been shown to slightly reduce the WER of a large-vocabulary speech recognizer, but only in a rescoring paradigm using a very small test set [18]. Landmarks can reduce computational load for DNN/HMM hybrid models [12,13] and can improve recognition accuracy [11]. Previous works [11,12,13,19] annotated landmark positions mostly following experimental findings presented in [20,21].…”

Section: Acoustic Landmarksmentioning

confidence: 99%

See 1 more Smart Citation

When CTC Training Meets Acoustic Landmarks

Yang

Lim³

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Connectionist temporal classification (CTC) provides an endto-end acoustic model (AM) training strategy. CTC learns accurate AMs without time-aligned phonetic transcription, but sometimes fails to converge, especially in resourceconstrained scenarios. In this paper, the convergence properties of CTC are improved by incorporating acoustic landmarks. We tailored a new set of acoustic landmarks to help CTC training converge more rapidly and smoothly while also reducing recognition error rates. We leveraged new target label sequences mixed with both phone and manner changes to guide CTC training. Experiments on TIMIT demonstrated that CTC based acoustic models converge significantly faster and smoother when they are augmented by acoustic landmarks. The models pretrained with mixed target labels can be further finetuned, resulting in phone error rates 8.72% below baseline on TIMIT. Consistent performance gain is also observed on WSJ (a larger corpus) and reduced TIMIT (smaller). With WSJ, we are the first to succeed in verifying the effectiveness of acoustic landmark theory on a mid-sized ASR task.

show abstract

Section: Acoustic Landmarksmentioning

confidence: 99%

“…Many efforts have been attempted to augment acoustic modeling with acoustic landmarks [11,12,13] which are detected by accurate time-aligned phonetic transcriptions. To the best of our knowledge, only TIMIT [14] (5.4 hours) provides such fine-grained transcriptions.…”

Section: Introductionmentioning

confidence: 99%

When CTC Training Meets Acoustic Landmarks

Yang

Lim³

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We extracted landmark training labels by referencing the TIMIT human annotated phone boundaries. An example of the labeling is presented in Fig 2. This example from [7] illustrates the labeling of the word "Symposium" 1 . The figure is generated using Praat [19].…”

Section: Defining and Marking Landmarksmentioning

confidence: 99%

“…Automatic speech recognition (ASR) systems have been proposed that depend completely on landmarks, with no regard for the steady-state regions of the speech signal [5], and such systems have been demonstrated to be competitive with phone-based ASR under certain circumstances. Other studies have proposed training two separate sets of classifiers, one trained to recognize landmarks, another trained to recognize steady-state phone segments, and fusing the two for improved accuracy [6] or for reduced computational complexity [7]. It has been difficult to build cross-lingual ASR from such sys-tems, however, because very few of the world's languages possess large corpora with the correct timing of consonant release and consonant closure landmarks manually coded.…”

Section: Introductionmentioning

confidence: 99%

Improved ASR for Under-resourced Languages through Multi-task Learning with Acoustic Landmarks

Lim²,

Yang

et al. 2018

Interspeech 2018

Self Cite

View full text Add to dashboard Cite

Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.

show abstract

“…The MTL approach is applied to neural networks by sharing some of the hidden layers between different tasks. Some research could improve the accuracy of CTC-based ASR by incorporating acoustic landmarks, which could help CTC training converge more rapidly and smoothly [66,67]. Moreover, the information of acoustic landmarks could be obtained, which could be used as an additional information source, to further improve the performance of the APED system [68].…”

mentioning

confidence: 99%

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Zhang

Zhao

et al. 2020

Sensors

View full text Add to dashboard Cite

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.

show abstract

Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

Cited by 11 publications

References 33 publications

When CTC Training Meets Acoustic Landmarks

When CTC Training Meets Acoustic Landmarks

Improved ASR for Under-resourced Languages through Multi-task Learning with Acoustic Landmarks

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Contact Info

Product

Resources

About