RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition

Zeyer, Albert; Alkhouli, Tamer; Ney, Hermann

doi:10.18653/v1/p18-4022

Cited by 62 publications

(50 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An MP layer has no trainable parameters, but helps primarily in reducing the length of an input sequence by a predetermined pooling factor. It has been observed that reducing the input sequence length progressively through the encoder helps in better convergence and accuracy of the AED models [2,8,16]. In this work, we use an overall time reduction factor of 2 for character encoder, and an additional reduction factor of 4 in the BPE stack, taking the overall reduction factor for the BPE encoder to 8.…”

Section: Attention Based Encoder-decoder Modelsmentioning

confidence: 99%

“…In [9], a teacher-student transfer learning has been used to improve the convergence as well as the performance of character based CTC encoder models. In the case of AED models, it has been shown that a carefully designed layerwise pre-training strategy helps the models to converge better [2]. Similarly, a joint multi-task training using a CTC loss on the encoder and a CE loss on the attention decoder was proposed to improve the convergence and accuracies of AED models [19,20].…”

Section: Multi-stage Multi-task Training Of Online Attention Modelsmentioning

confidence: 99%

“…Our decoder has a single ULSTM layer. We also use attention feedback and a maxout layer similar to to [2]. This whole network is now trained with the cross-entropy loss between decoder output and the ground truth BPE target labels.…”

Section: Stage-3: Training the Attention-decodermentioning

confidence: 99%

“…Our decoder consists of a single ULSTM layer with 1024 units. We use RETURNN framework for all our experiments [26,2,27]. In all our experiments we use Adam optimizer [28] with an initial learning rate of 0.0008 and learning rate scheduling using a cross-validation set.…”

Section: Datasets and Trainingmentioning

confidence: 99%

“…Recently, attention-based encoder-decoder (AED) models have gained popularity for developing end-to-end neural network based automatic speech recognition (ASR) systems [1,2,3]. One of the primary advantages of AED models is that the language information is tightly coupled into the decoder, obviating the need for an external language model (LM).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models

Garg

Gowda

Kumar

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

In this paper, we propose a refined multi-stage multi-task training strategy to improve the performance of online attentionbased encoder-decoder (AED) models. A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed. Also, multi-task learning based on two-levels of linguistic granularity namely, character and BPE, is used. We explore different pre-training strategies for the encoders including transfer learning from a bidirectional encoder. Our encoder-decoder models with online attention show ∼35% and ∼10% relative improvement over their baselines for smaller and bigger models, respectively. Our models achieve a word error rate (WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and bigger models respectively after fusion with long short-term memory (LSTM) based external language model (LM).

show abstract

Section: Attention Based Encoder-decoder Modelsmentioning

confidence: 99%

Section: Multi-stage Multi-task Training Of Online Attention Modelsmentioning

confidence: 99%

Section: Stage-3: Training the Attention-decodermentioning

confidence: 99%

Section: Datasets and Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models

Garg

Gowda

Kumar

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

A Comparative Study on End-to-End Speech to Text Translation

Bahar

Bieschke

Ney

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

Recent advances in deep learning show that end-to-end speech to text translation model is a promising approach to direct the speech translation field. In this work, we provide an overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence. We also investigate on pre-training variants such as initializing different components of a model using pretrained models, and their impact on the final performance, which gives boosts up to 4% in BLEU and 5% in TER. Our experiments are performed on 270h IWSLT TED-talks En→De, and 100h LibriSpeech Audiobooks En→Fr. We also show improvements over the current end-to-end state-of-the-art systems on both tasks.

show abstract

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Wang

Khudanpur

Chen

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

We present ESPRESSO, an open-source, modular, extensible endto-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit FAIRSEQ. ESPRESSO supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. ESPRESSO achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4-11× faster for decoding than similar systems (e.g. ESPNET).

show abstract

RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition

Cited by 62 publications

References 31 publications

Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models

Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models

A Comparative Study on End-to-End Speech to Text Translation

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Contact Info

Product

Resources

About