Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR

Zhao, Zeyu; Bell, P. J.

doi:10.1109/icassp43922.2022.9746821

Cited by 4 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(3) Combined with the CTC decoding algorithm, the end-to-end prediction of sequence data is effectively realized [10].…”

Section: Core Idea Of Ctcmentioning

confidence: 99%

See 1 more Smart Citation

Speech Recognition via CTC-CNN Model

Sung,

Kang,

Hsiao

2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

In the speech recognition system, the acoustic model is an important underlying model, and its accuracy directly affects the performance of the entire system. This paper introduces the construction and training process of the acoustic model in detail and studies the Connectionist temporal classification (CTC) algorithm, which plays an important role in the end-to-end framework, established a convolutional neural network (CNN) combined with an acoustic model of Connectionist temporal classification to improve the accuracy of speech recognition. This study uses a sound sensor, ReSpeaker Mic Array v2.0.1, to convert the collected speech signals into text or corresponding speech signals to improve communication and reduce noise and hardware interference. The baseline acoustic model in this study faces challenges such as long training time, high error rate, and a certain degree of overfitting. The model is trained through continuous design and improvement of the relevant parameters of the acoustic model, and finally the performance is selected according to the evaluation index. Excellent model, which reduces the error rate to about 18%, thus improving the accuracy rate. Finally, comparative verification was carried out from the selection of acoustic feature parameters, the selection of modeling units, and the speaker's speech rate, which further verified the excellent performance of the CTCCNN_5 + BN + Residual model structure. In terms of experiments, to train and verify the CTC-CNN baseline acoustic model, this study uses THCHS-30 and ST-CMDS speech data sets as training data sets, and after 54 epochs of training, the word error rate of the acoustic model training set is 31%, the word error rate of the test set is stable at about 43%. This experiment also considers the surrounding environmental noise. Under the noise level of 80∼90 dB, the accuracy rate is 88.18%, which is the worst performance among all levels. In contrast, at 40-60 dB, the accuracy was as high as 97.33% due to less noise pollution.

show abstract

“…(3) Combined with the CTC decoding algorithm, the end-to-end prediction of sequence data is effectively realized [10].…”

Section: Core Idea Of Ctcmentioning

confidence: 99%

“…The recursive calculation formula of the backward probability obtained by recursion is shown in Eq. (10).…”

Section: Core Idea Of Ctcmentioning

confidence: 99%

Speech Recognition via CTC-CNN Model

Sung,

Kang,

Hsiao

2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

show abstract

“…Note that with most topologies (except CTC topology), not all the paths in E are valid, which makes the summation of the probabilities of all possible word sequences not equal to one. A previous work [22] has shown that the normalisation (the denominator term in eq. ( 2)) is crucial for the sequence-level loss function.…”

Section: Trainingmentioning

confidence: 99%

“…Note that in [26], the authors compared different CTC topologies, while in this work, we compare different topologies most of which are not equivalent to CTC. Inspired by previous work [22], we introduce an extra state for each phone to increase the modelling power, resulting in the S2-T1 topology in table 1, where there is no self-loop for the first state in S2-T1, and the second state is optional and skippable. S2-T1⋆ is similar to S2-T1, except for the self-loop on the first state.…”

Section: Topologiesmentioning

confidence: 99%

Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR

Zhao¹,

Bell²

2023

Interspeech 2023

View full text Add to dashboard Cite

End-to-end (E2E) Automatic Speech Recognition (ASR) has gained popularity in recent years, with most research focusing on designing novel neural network architectures, speech representations, and loss functions. However, the importance of topology in E2E ASR has been largely neglected. There are many aspects of topology to consider; in this paper, we focus on the relationship between topologies' minimum traversal time and output frame rate, the number of distinct states for each output unit, and the flexibility of alignments admitted. We examine several different topologies on two datasets: WSJ and Librispeech. Our experiments reveal that different frame rates have varying optimal topologies and that the commonly used Connectionist Temporal Classification (CTC) topology is not always optimal. Our findings suggest that the choice of topology is an important consideration in the design of E2E ASR systems.

show abstract