Comparison of Decoding Strategies for CTC Acoustic Models

Zenkel, Thomas; Sanabria, Ramon; Metze, Florian; Niehues, Jan; Sperber, Matthias; Stüker, Sebastian; Waibel, Alex

doi:10.21437/interspeech.2017-1683

Cited by 36 publications

(34 citation statements)

References 28 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Except for the modeling unit, these models are very similar to conventional acoustic models and perform well when combined with an external LM during decoding (beam search). [23,24].…”

Section: Introductionmentioning

confidence: 99%

Hybrid Autoregressive Transducer (HAT)

Variani¹,

Rybach²,

Allauzen³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

118

View full text Add to dashboard Cite

This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This article also presents a finite context version of the HAT model that addresses the exposure bias problem and significantly simplifies the overall training and inference. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches .Index Terms-ASR, Encoder-decoder, Beam Search T t=1 P ( Y t =ỹ t |X). Finally P (Y |X) is calculated by marginalizing over the alignment posteriors with Eq 2.

show abstract

“…Except for the modeling unit, these models are very similar to conventional acoustic models and perform well when combined with an external LM during decoding (beam search). [23,24].…”

Section: Introductionmentioning

confidence: 99%

Hybrid Autoregressive Transducer (HAT)

Variani¹,

Rybach²,

Allauzen³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

118

View full text Add to dashboard Cite

show abstract

“…Some works use this decoding method to build the CTC-layers in their hardware architectures of RNNs [17]. Although this way can already provide useful transcriptions, its limited accuracy is not sufficient to meet the demands of many sequence tasks [26].…”

Section: B Ctc Beam Search Decodingmentioning

confidence: 99%

“…In ASR tasks, the traditional approach is based on HMMs [16], while recent works have shown great interest in building end-to-end models, using CTC-based deep RNNs. By training networks with large amounts of data, CTC-based models achieved great success [7], [11], [4], [12], [26], [18]. CTC is also widely used in other learning tasks such as handwriting recognition and scene text recognition, offering superior performance [8], [2], [19].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Hardware-Oriented and Memory-Efficient Method for CTC Decoding

Lin

et al. 2019

IEEE Access

View full text Add to dashboard Cite

The Connectionist Temporal Classification (CTC) has achieved great success in sequence to sequence analysis tasks such as automatic speech recognition (ASR) and scene text recognition (STR). These applications can use the CTC objective function to train the recurrent neural networks (RNNs), and decode the outputs of RNNs during inference. While hardware architectures for RNNs have been studied, hardware-based CTCdecoders are desired for high-speed CTC-based inference systems. This paper, for the first time, provides a low-complexity and memory-efficient approach to build a CTC-decoder based on the beam search decoding. Firstly, we improve the beam search decoding algorithm to save the storage space. Secondly, we compress a dictionary (reduced from 26.02MB to 1.12MB) and use it as the language model. Meanwhile searching this dictionary is trivial. Finally, a fixed-point CTC-decoder for an English ASR and an STR task using the proposed method is implemented with C++ language. It is shown that the proposed method has little precision loss compared with its floating-point counterpart. Our experiments demonstrate the compression ratio of the storage required by the proposed beam search decoding algorithm are 29.49 (ASR) and 17.95 (STR).

show abstract

“…The first lexicon-free beam-search decoder aiming at dealing with OOV was benchmarked on Switchboard [21], although with a significantly worse word error rate (WER) than lexicon-based systems. Other recent works in this direction include [22,23] on the English and [24, 25] on the Arabic and Finnish languages.Here, we study a simple end-to-end ASR system combining a character level acoustic model with a character level language model through beam search. We show that it can yield competitive word error rates on the WSJ and Librispeech corporas, even without a lexicon.…”

mentioning

confidence: 99%

Who Needs Words? Lexicon-Free Speech Recognition

2019

View full text Add to dashboard Cite

Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, both with character or word-based LMs.Character-based models permeated text classification [1], language modeling [2, 3, 4], machine translation [5,6,7,8,9,10], and automatic speech recognition (ASR) [11,12,13]. However, most competitive ASR systems, character based or not, use a beam search decoder constrained on a word-level language model and lexicon [14,15,16,17]. In recent works [18,19], authors achieved competitive results with acoustic models (AM) and LMs operating on word pieces, and a lexicon-free decoder. To the best of our knowledge, the first ASR system to achieve competitive results with a character-based LM and without a lexicon (on Switchboard and WSJ) was [20], that our lexicon-free character-based ConvLM surpasses on WSJ (see Table 4).The main advantage of a lexicon-free approach is that it allows the decoder to handle out-of-vocabulary (OOV) words: the decoder and the language model are responsible not only for scoring words but usually also restrict the vocabulary. Drawbacks sometimes include system complexity and most often poorer performance than in the lexicon based case. The first lexicon-free beam-search decoder aiming at dealing with OOV was benchmarked on Switchboard [21], although with a significantly worse word error rate (WER) than lexicon-based systems. Other recent works in this direction include [22,23] on the English and [24, 25] on the Arabic and Finnish languages.Here, we study a simple end-to-end ASR system combining a character level acoustic model with a character level language model through beam search. We show that it can yield competitive word error rates on the WSJ and Librispeech corporas, even without a lexicon. Finally, our model shows significant word error rates improvement on utterances that include out-of-vocabulary words. SetupAcoustic model (AM) We consider in this paper 1D gated convolutional neural networks [26,15], trained to map speech features (log-mel filterbanks) to their corresponding letter transcription. The training criterion is the auto segmentation criterion (ASG) [27]. The token set contains 31 graphemes: the standard English alphabet, the apostrophe

show abstract

Comparison of Decoding Strategies for CTC Acoustic Models

Cited by 36 publications

References 28 publications

Hybrid Autoregressive Transducer (HAT)

Hybrid Autoregressive Transducer (HAT)

A Hardware-Oriented and Memory-Efficient Method for CTC Decoding

Who Needs Words? Lexicon-Free Speech Recognition

Contact Info

Product

Resources

About