Comparison of Decoding Strategies for CTC Acoustic Models

In this work, we focus on multilingual systems based on recurrent neural networks (RNNs), trained using the Connectionist Temporal Classification (CTC) loss function. Using a multilingual set of acoustic units poses difficulties. To address this issue, we proposed Language Feature Vectors (LFVs) to train language adaptive multilingual systems. Language adaptation, in contrast to speaker adaptation, needs to be applied not only on the feature level, but also to deeper layers of the network. In this work, we therefore extended our previous approach by introducing a novel technique which we call "modulation". Based on this method, we modulated the hidden layers of RNNs using LFVs. We evaluated this approach in both full and low resource conditions, as well as for grapheme and phone based systems. Lower error rates throughout the different conditions could be achieved by the use of the modulation.

show abstract

“…We used a RNN based LM, trained on graphemes as described in [22]. It featured 1 hidden layer with 1024 LSTM cells.…”

Section: Grapheme Based Rnn Lmmentioning

confidence: 99%

Multilingual Adaptation of RNN Based ASR Systems

Miiller

Stiiker

Waibel

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Except for the modeling unit, these models are very similar to conventional acoustic models and perform well when combined with an external LM during decoding (beam search). [23,24].…”

Section: Introductionmentioning

confidence: 99%

Hybrid Autoregressive Transducer (HAT)

Variani¹,

Rybach²,

Allauzen³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

118

View full text Add to dashboard Cite

This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This article also presents a finite context version of the HAT model that addresses the exposure bias problem and significantly simplifies the overall training and inference. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches .Index Terms-ASR, Encoder-decoder, Beam Search T t=1 P ( Y t =ỹ t |X). Finally P (Y |X) is calculated by marginalizing over the alignment posteriors with Eq 2.

show abstract

“…To evaluate our setup, we used the same decoding procedure as in [3] and greedily search the best path without an external language model and evaluated our systems by computing the token error rate (TER) as primary measure. In addition, we trained a character based neural network language model for English on the training utterances, as described in [50], so that for the recognition of English we could also measure a word error rate (WER) by decoding the network outputs with this language model. As the language model is only trained on only a small amount of data, the word error rate obtained with it should indicate whether the improvements in TER of the pure CTC model measured on English also lead to a better word level speech recognition system.…”

Section: Discussionmentioning

confidence: 99%

Language Adaptive Multilingual CTC Speech Recognition

Müller

Stüker

Waibel

2017

Speech and Computer

Self Cite

View full text Add to dashboard Cite

Training automatic speech recognition (ASR) systems requires large amounts of data in the target language in order to achieve good performance. Whereas large training corpora are readily available for languages like English, there exists a long tail of languages which do suffer from a lack of resources. One method to handle data sparsity is to use data from additional source languages and build a multilingual system. Recently, ASR systems based on recurrent neural networks (RNNs) trained with connectionist temporal classification (CTC) have gained substantial research interest. In this work, we extended our previous approach towards training CTC-based systems multilingually. Our systems feature a global phone set, based on the joint phone sets of each source language. We evaluated the use of different language combinations as well as the addition of Language Feature Vectors (LFVs). As contrastive experiment, we built systems based on graphemes as well. Systems having a multilingual phone set are known to suffer in performance compared to their monolingual counterparts. With our proposed approach, we could reduce the gap between these mono-and multilingual setups, using either graphemes or phonemes.

show abstract

Comparison of Decoding Strategies for CTC Acoustic Models

Cited by 8 publications

References 0 publications

Multilingual Adaptation of RNN Based ASR Systems

Multilingual Adaptation of RNN Based ASR Systems

Hybrid Autoregressive Transducer (HAT)

Language Adaptive Multilingual CTC Speech Recognition

Contact Info

Product

Resources

About