Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation

This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-toCroatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.

show abstract

“…This paper builds upon our recent work on this topic (Klubička et al, 2017), which is here extended in a number of directions:…”

Section: Introductionmentioning

confidence: 99%

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Klubička¹,

Toral

Sánchez-Cartagena³

2018

Machine Translation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Error analysis of NMT systems has also been on the radar of the MT field. Several papers have carried out automatic (Bentivogli et al 2016;Toral and Sánchez-Cartagena 2017) or human error annotation (Burchardt et al 2017;Klubička et al 2017;Popović 2017;Castilho et al 2018) in order to compare phrase-based and neural approaches for different language pairs and domains. In this issue, Calixto and Liu present an extensive error analysis of several MT systems, including two text-only systems that fall into the PBSMT and NMT paradigms, and a set of multimodal NMT models which use not only text but also visual information extracted from images.…”

Section: Error Analysismentioning

confidence: 99%

Editors’ foreword to the special issue on human factors in neural machine translation

Castilho

Gaspari

Moorkens

et al. 2019

Machine Translation

Self Cite

View full text Add to dashboard Cite

“…by Bentivogli et al (2016); Toral Ruiz and Sánchez-Cartagena (2017); Costajussà (2017); Klubička et al (2017). These works differ in the language pairs and in the error typology considered.…”

Section: Related Work: Evaluating Morphologymentioning

confidence: 99%

“…These works differ in the language pairs and in the error typology considered. Bentivogli et al (2016) only recognizes three main error types which are automatically recognized based on aligning the hypotheses and references -for instance a morphological error is detected when the word form is wrong, whereas the lemma is correct; this definition is also adopted in , and decomposed at the level of morphological features in ; (Klubička et al, 2017) Table 11: Sentence group evaluation for English-to-Latvian with Entropy (C-set).…”

Section: Related Work: Evaluating Morphologymentioning

confidence: 99%

Evaluating the morphological competence of Machine Translation Systems

Burlot¹,

Yvon²

2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

While recent changes in Machine Translation state-of-the-art brought translation quality a step further, it is regularly acknowledged that the standard automatic metrics do not provide enough insights to fully measure the impact of neural models. This paper proposes a new type of evaluation focused specifically on the morphological competence of a system with respect to various grammatical phenomena. Our approach uses automatically generated pairs of source sentences, where each pair tests one morphological contrast. This methodology is used to compare several systems submitted at WMT'17 for English into Czech and Latvian.

show abstract

Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation

Cited by 64 publications

References 10 publications

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Editors’ foreword to the special issue on human factors in neural machine translation

Evaluating the morphological competence of Machine Translation Systems

Contact Info

Product

Resources

About