End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

Libovický, Jindřich; Helcl, Jindřich

doi:10.18653/v1/d18-1336

Cited by 116 publications

(141 citation statements)

References 12 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…We compare our approach to three other parallel decoding translation methods: the fertility-based sequence-to-sequence model of Gu et al (2018), the CTC-loss transformer of Libovický and Helcl (2018), and the iterative refinement approach of . The first two methods are purely non-autoregressive, while the iterative refinement approach is only non-autoregressive in the first decoding iteration, similar to our approach.…”

Section: Translation Qualitymentioning

confidence: 99%

“…Gu et al (2018) introduce a transformer-based approach with explicit word fertility, and identify the multi-modality problem. Libovický and Helcl (2018) approach the multimodality problem by collapsing repetitions with the Connectionist Temporal Classification training objective (Graves et al, 2006). Perhaps most similar to our work is the iterative refinement approach of , in which the model corrects the original non-autoregressive prediction by passing it multiple times through a denoising autoencoder.…”

Section: Parallel Decoding For Machine Translationmentioning

confidence: 99%

See 1 more Smart Citation

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Ghazvininejad¹,

Levy²,

Liu³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

370

552

View full text Add to dashboard Cite

Most machine translation systems generate text autoregressively from left to right. We, instead, use a masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding, where we first predict all of the target words non-autoregressively, and then repeatedly mask out and regenerate the subset of words that the model is least confident about. By applying this strategy for a constant number of iterations, our model improves state-of-the-art performance levels for nonautoregressive and parallel decoding translation models by over 4 BLEU on average. It is also able to reach within about 1 BLEU point of a typical left-to-right transformer model, while decoding significantly faster.

show abstract

Section: Translation Qualitymentioning

confidence: 99%

Section: Parallel Decoding For Machine Translationmentioning

confidence: 99%

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Ghazvininejad¹,

Levy²,

Liu³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

370

552

View full text Add to dashboard Cite

show abstract

“…We first conduct experiments to compare the performance of FlowSeq with strong baseline models, including NAT w/ Fertility (Gu et al, 2018), NAT-IR , NAT-REG (Wang et al, 2019), LV NAR (Shu et al, 2019), CTC Loss (Libovickỳ and Helcl, 2018), and CMLM (Ghazvininejad et al, 2019). Table 1 provides the BLEU scores of FlowSeq with argmax decoding, together with baselines with purely non-autoregressive decoding methods that generate output sequence in one parallel pass.…”

Section: Resultsmentioning

confidence: 99%

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

Ma¹,

Zhou²,

Li³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

125

126

View full text Add to dashboard Cite

Most sequence-to-sequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and even with increasingly complex model structures accuracy lags significantly behind autoregressive models. In this paper, we propose a simple, efficient, and effective model for non-autoregressive sequence generation using latent variable models. Specifically, we turn to generative flow, an elegant technique to model complex distributions using neural networks, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. We evaluate this model on three neural machine translation (NMT) benchmark datasets, achieving comparable performance with state-of-the-art nonautoregressive NMT models and almost constant decoding time w.r.t the sequence length. 1

show abstract

“…Lee et al (2018) proposed a method of iterative refinement based on latent variable model and denoising autoencoder. Libovick and Helcl (2018) take NAT as a connectionist temporal classification problem, which achieved better latency. Kaiser et al (2018) use discrete latent variables that makes decoding much more parallelizable.…”

Section: Ablation Studymentioning

confidence: 99%

“…* This work was done when the first author was on an internship at Tencent. Recently, a line of research work (Gu et al, 2017;Lee et al, 2018;Libovick and Helcl, 2018;Wang et al, 2018) propose to break the autoregressive bottleneck by introducing non-autoregressive neural machine translation (NAT). In NAT, the decoder generates all words simultaneously instead of sequentially.…”

Section: Introductionmentioning

confidence: 99%

Imitation Learning for Non-Autoregressive Neural Machine Translation

Wei¹,

Wang²,

Zhou³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Non-autoregressive translation models (NAT) have achieved impressive inference speedup. A potential issue of the existing NAT algorithms, however, is that the decoding is conducted in parallel, without directly considering previous context. In this paper, we propose an imitation learning framework for nonautoregressive machine translation, which still enjoys the fast translation speed but gives comparable translation performance compared to its auto-regressive counterpart. We conduct experiments on the IWSLT16, WMT14 and WMT16 datasets. Our proposed model achieves a significant speedup over the autoregressive models, while keeping the translation quality comparable to the autoregressive models. By sampling sentence length in parallel at inference time, we achieve the performance of 31.85 BLEU on WMT16 Ro→En and 30.68 BLEU on IWSLT16 En→De.

show abstract

End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

Cited by 116 publications

References 12 publications

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

Imitation Learning for Non-Autoregressive Neural Machine Translation

Contact Info

Product

Resources

About