A Challenge Set Approach to Evaluating Machine Translation

Isabelle, Pierre; Cherry, Colin; Foster, George

doi:10.18653/v1/d17-1263

Cited by 106 publications

(116 citation statements)

References 12 publications

Supporting

Mentioning

101

Contrasting

Unclassified

Order By: Relevance

“…This requires attending to two or more regions that can be arbitrarily distant from one another. Several phenomena, such as light verbs (Isabelle and Kuhn, 2018), are known from the linguistic and MT literature to yield lexical LDD. Our methodology takes a predefined set of such phenomena, and defines rules for detecting each of them over dependency parses of the source-side.…”

Section: Methodsmentioning

confidence: 99%

“…With major improvements in system performance, crude assessments of performance are becoming less satisfying, i.e., evaluation metrics do not give an indication on the performance of MT systems on important challenges for the field (Isabelle and Kuhn, 2018). String-similarity metrics against a reference are known to be partial and coarsegrained aspects of the task (Callison-Burch et al, 2006), but are still the common practice in various text generation tasks.…”

Section: Mt Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

Choshen¹,

Abend²

2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

We show that the state-of-the-art Transformer MT model is not biased towards monotonic reordering (unlike previous recurrent neural network models), but that nevertheless, longdistance dependencies remain a challenge for the model. Since most dependencies are shortdistance, common evaluation metrics will be little influenced by how well systems perform on them. We therefore propose an automatic approach for extracting challenge sets replete with long-distance dependencies, and argue that evaluation using this methodology provides a complementary perspective on system performance. To support our claim, we compile challenge sets for English-German and German-English, which are much larger than any previously released challenge set for MT. The extracted sets are large enough to allow reliable automatic evaluation, which makes the proposed approach a scalable and practical solution for evaluating MT performance on the long-tail of syntactic phenomena. 1

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Mt Evaluationmentioning

confidence: 99%

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

Choshen¹,

Abend²

2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

show abstract

“…Another line of work has analyzed the robustness of NLP models both via controlled experiments to complement the information from the test set accuracy and test abilities of the models (Isabelle et al, 2017;B. Hashemi and Hwa, 2016;White et al, 2017) and via adversarial instances to expose weaknesses (Jia and Liang, 2017).…”

Section: Analysis Of Complex Modelsmentioning

confidence: 99%

“…We draw motivation to study the robustness of NLI models from previous work on evaluating complex models (Isabelle et al, 2017;White et al, 2017). Furthermore, we base our approach on the discipline of behavioral science which provides methodologies for analyzing how certain factors influence the behavior of subjects under study (Epling and Pierce, 1986).…”

Section: Introductionmentioning

confidence: 99%

Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

Sánchez¹,

Mitchell²,

Riedel³

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

Natural Language Inference is a challenging task that has received substantial attention, and state-of-the-art models now achieve impressive test set performance in the form of accuracy scores. Here, we go beyond this single evaluation metric to examine robustness to semantically-valid alterations to the input data. We identify three factors -insensitivity, polarity and unseen pairs -and compare their impact on three SNLI models under a variety of conditions. Our results demonstrate a number of strengths and weaknesses in the models' ability to generalise to new in-domain instances. In particular, while strong performance is possible on unseen hypernyms, unseen antonyms are more challenging for all the models. More generally, the models suffer from an insensitivity to certain small but semantically significant alterations, and are also often influenced by simple statistical correlations between words and training labels. Overall, we show that evaluations of NLI models can benefit from studying the influence of factors intrinsic to the models or found in the dataset used.

show abstract

“…in (Bentivogli et al, 2016). Recently, various new proposals have been put forward to better diagnose neural models, notably by Linzen et al (2016); Sennrich (2017), who focus respectively on the syntactic competence of Neural Language Models (NLMs) or of NMT; and by Isabelle et al (2017); Burchardt et al (2017), who resuscitate an old tradition of designing test suites. Inspired by these (and other) works (see § 4), we propose in this paper a new evaluation scheme aimed at specifically assessing the morphological competence of MT engines translating from English into a Morphologically Rich Language (MRL).…”

Section: Introductionmentioning

confidence: 99%

Evaluating the morphological competence of Machine Translation Systems

Burlot¹,

Yvon²

2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

While recent changes in Machine Translation state-of-the-art brought translation quality a step further, it is regularly acknowledged that the standard automatic metrics do not provide enough insights to fully measure the impact of neural models. This paper proposes a new type of evaluation focused specifically on the morphological competence of a system with respect to various grammatical phenomena. Our approach uses automatically generated pairs of source sentences, where each pair tests one morphological contrast. This methodology is used to compare several systems submitted at WMT'17 for English into Czech and Latvian.

show abstract

A Challenge Set Approach to Evaluating Machine Translation

Cited by 106 publications

References 12 publications

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

Evaluating the morphological competence of Machine Translation Systems

Contact Info

Product

Resources

About