Learning to Parse and Translate Improves Neural Machine Translation

Eriguchi, Akiko; Tsuruoka, Yoshimasa; Cho, Kyunghyun

doi:10.18653/v1/p17-2012

Cited by 128 publications

(116 citation statements)

References 28 publications

(24 reference statements)

Supporting

Mentioning

115

Contrasting

Unclassified

Order By: Relevance

“…In contrast to these approaches, the DSA-LSTM only models the probability of surface strings, albeit with an auxiliary loss that distills the next-word predictive distribution of a syntactic language model. Earlier work has also explored multi-task learning with syntactic objectives as an auxiliary loss in language modelling and machine translation (Luong et al, 2016;Eriguchi et al, 2016;Nadejde et al, 2017;Enguehard et al, 2017;Aharoni and Goldberg, 2017;Eriguchi et al, 2017). Our approach of injecting syntactic bias through a KD objective is orthogonal to this approach, with the primary difference that here the student DSA-LSTM has no direct access to syntactic annotations; it does, however, have access to the teacher RNNG's softmax distribution over the next word.…”

Section: Related Workmentioning

confidence: 99%

Scalable Syntax-Aware Language Models Using Knowledge Distillation

Kuncoro

Dyer

Rimell

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders scaling difficult, and it remains an open question whether structural biases are still necessary when sequential models have access to ever larger amounts of training data. To answer this question, we introduce an efficient knowledge distillation (KD) technique that transfers knowledge from a syntactic language model trained on a small corpus to an LSTM language model, hence enabling the LSTM to develop a more structurally sensitive representation of the larger training data it learns from. On targeted syntactic evaluations, we find that, while sequential LSTMs perform much better than previously reported, our proposed technique substantially improves on this baseline, yielding a new state of the art. Our findings and analysis affirm the importance of structural biases, even in models that learn from large amounts of data.

show abstract

Section: Related Workmentioning

confidence: 99%

Scalable Syntax-Aware Language Models Using Knowledge Distillation

Kuncoro

Dyer

Rimell

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…We then experiment in a low-resource scenario using the German, Russian and Czech to English training data from the News Commentary v8 corpus, following Eriguchi et al (2017). In all cases we parse the English sentences into constituency trees using the BLLIP parser (Charniak and Johnson, 2005).…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…In parallel and highly related to our work, Eriguchi et al (2017) proposed to model the target syntax in NMT in the form of dependency trees by using an RNNG-based decoder (Dyer et al, 2016), while Nadejde et al (2017) incorporated target syntax by predicting CCG tags serialized into the target translation. Our work differs from those by modeling syntax using constituency trees, as was previously common in the "traditional" syntaxbased machine translation literature.…”

Section: Introduction and Modelmentioning

confidence: 99%

Towards String-To-Tree Neural Machine Translation

Aharoni¹,

Goldberg²

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 2: Short Papers)

122

113

View full text Add to dashboard Cite

We present a simple method to incorporate syntactic information about the target language in a neural machine translation system by translating into linearized, lexicalized constituency trees. Experiments on the WMT16 German-English news translation task shown improved BLEU scores when compared to a syntax-agnostic NMT baseline trained on the same dataset. An analysis of the translations from the syntax-aware system shows that it performs more reordering during translation in comparison to the baseline. A smallscale human evaluation also showed an advantage to the syntax-aware system.

show abstract

“…Similarly, [35] incorporate linguistic annotation to semantic role labeling task. [9] combined translation and dependency parsing by sharing the translation encoder hidden states with the buffer hidden states in a shift-reduce parsing model [8]. Aiming at the same goal, [1] proposed a very simple method.…”

Section: Related Workmentioning

confidence: 99%

Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed

Pham¹,

Macháček²,

Bojar³

2019

CyS

View full text Add to dashboard Cite

The utility of linguistic annotation in neural machine translation seemed to had been established in past papers. The experiments were however limited to recurrent sequence-to-sequence architectures and relatively small data settings.We focus on the state-of-the-art Transformer model and use comparably larger corpora. Specifically, we try to promote the knowledge of source-side syntax using multi-task learning either through simple data manipulation techniques or through a dedicated model component. In particular, we train one of Transformer attention heads to produce source-side dependency tree.Overall, our results cast some doubt on the utility of multi-task setups with linguistic information. The data manipulation techniques, recommended in previous works, prove ineffective in large data settings.The treatment of self-attention as dependencies seems much more promising: it helps in translation and reveals that Transformer model can very easily grasp the syntactic structure. An important but curious result is, however, that identical gains are obtained by using trivial "linear trees" instead of true dependencies. The reason for the gain thus may not be coming from the added linguistic knowledge but from some simpler regularizing effect we induced on self-attention matrices.

show abstract

Learning to Parse and Translate Improves Neural Machine Translation

Cited by 128 publications

References 28 publications

Scalable Syntax-Aware Language Models Using Knowledge Distillation

Scalable Syntax-Aware Language Models Using Knowledge Distillation

Towards String-To-Tree Neural Machine Translation

Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed

Contact Info

Product

Resources

About