Vectorized Beam Search for CTC-Attention-Based Speech Recognition

Seki, Hiroshi; Hori, Takaaki; Watanabe, Shinji; Moritz, Niko; Roux, Jonathan Le

doi:10.21437/interspeech.2019-2860

Cited by 33 publications

(46 citation statements)

References 24 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Large-scale training/decoding We support job schedulers (e.g., SLURM, Grid Engine), multiple GPUs and half/mixed-precision training/decoding with apex (Micikevicius et al, 2018). 5 Our beam search implementation vectorizes hypotheses for faster decoding (Seki et al, 2019).…”

Section: Additional Featuresmentioning

confidence: 99%

ESPnet-ST: All-in-One Speech Translation Toolkit

Inaguma¹,

Kiyono²,

Duh³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Self Cite

117

112

View full text Add to dashboard Cite

We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-toend speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines for a wide range of benchmark datasets. Our reproducible results can match or even outperform the current state-of-the-art performances; these pretrained models are downloadable. The toolkit is publicly available at https://github. com/espnet/espnet.

show abstract

Section: Additional Featuresmentioning

confidence: 99%

ESPnet-ST: All-in-One Speech Translation Toolkit

Inaguma¹,

Kiyono²,

Duh³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Self Cite

117

112

View full text Add to dashboard Cite

show abstract

“…We use joint training with hybrid CTC/attention ASR by setting mtl-alpha to 0.3 and asr-weight to 0.5 as defined by Watanabe et al (2018). During inference, we perform beam search (Seki et al, 2019) on the ST sequences, using a beam size of 10, length penalty of 0.2, max length ratio of 0.3 (Watanabe et al, 2018).…”

Section: Multi-decoder Modelmentioning

confidence: 99%

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

Dalmia¹,

Yan²,

Raunak³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

End-to-end approaches for sequence tasks are becoming increasingly popular. Yet for complex sequence tasks, like speech translation, systems that cascade several models trained on sub-tasks have shown to be superior, suggesting that the compositionality of cascaded systems simplifies learning and enables sophisticated search capabilities. In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks. These hidden intermediates can be improved using beam search to enhance the overall performance and can also incorporate external models at intermediate stages of the network to re-score or adapt towards out-of-domain data. One instance of the proposed framework is a Multi-Decoder model for speech translation that extracts the searchable hidden intermediates from a speech recognition sub-task. The model demonstrates the aforementioned benefits and outperforms the previous state-of-theart by around +6 and +3 BLEU on the two test sets of Fisher-CallHome and by around +3 and +4 BLEU on the English-German and English-French test sets of

show abstract

“…8best check-points are averaged and the averaged weights are used for decoding the hypothesis. Vectorized beam search (Seki et al, 2019) was used for decoding the ASR hypotheses with a beam size of 10. Further in this paper, ASR models described in this section are referred to as Ext.ASR models (Externally trained ASR models).…”

Section: Automatic Speech Recognition (Asr)mentioning

confidence: 99%

“…The noisy EOS tokens are pruned out using (Kahn et al, 2019). Vectorized beam (Seki et al, 2019) search has been used for decoding the hypotheses with a beam size of 8. A large variance in the performance is observed w.r.t the decoding hyper-parameters such as maximum target sequence length and length-bonus.…”

Section: Machine Translation Systems(mt)mentioning

confidence: 99%

The Iwslt 2021 but Speech Translation Systems

Vydana¹,

Karafiát²,

Burget³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

View full text Add to dashboard Cite

The paper describes BUT's English to German offline speech translation (ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a smaller amount of speechtranslation training data. Large amounts of ASR and MT training data are utilized for pretraining the ASR and MT models. Speechtranslation data is used to jointly optimize ASR-MT models by defining an end-to-end differentiable path from speech to translations. For this purpose, we use the internal continuous representations from the ASR-decoder as the input to MT module. We show that speech translation can be further improved by training the ASR-decoder jointly with the MT-module using large amount of text-only MT training data. We also show significant improvements by training an ASR module capable of generating punctuated text, rather than leaving the punctuation task to the MT module.

show abstract

Vectorized Beam Search for CTC-Attention-Based Speech Recognition

Cited by 33 publications

References 24 publications

ESPnet-ST: All-in-One Speech Translation Toolkit

ESPnet-ST: All-in-One Speech Translation Toolkit

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

The Iwslt 2021 but Speech Translation Systems

Contact Info

Product

Resources

About