The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results

Przybocki, Mark A.; Peterson, Kay; Bronsart, Sebastien; Sanders, Gregory A.

doi:10.1007/s10590-009-9065-6

Cited by 41 publications

(26 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Przybocki et al (2009) use, as part of their larger human evaluation, a single (7-point) scale (labeled "adequacy") to assess the quality of translations. Inter-annotator agreement for this method was κ = 0.25, even lower than the results for adequacy and fluency reported in WMT 2007 (noting that caution is required when directly comparing agreement measurements, especially over scales of varying granularity, such as 5-versus 7-point assessments).…”

Section: Past and Current Methodologiesmentioning

confidence: 99%

“…However, given the extent to which accurate human assessment of translation quality is fundamental to empirical MT, the underlying topic of finding ways of increasing the reliability of those assessments to date has received surprisingly little attention (Callison-Burch et al, 2007, 2008Przybocki, Peterson, Bronsart, and Sanders, 2009;, 2010Denkowski and Lavie, 2010).…”

Section: Validation Of Automatic Metricsmentioning

confidence: 99%

See 1 more Smart Citation

Can machine translation systems be evaluated by the crowd alone

et al. 2015

View full text Add to dashboard Cite

Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of WMT shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.

show abstract

Section: Past and Current Methodologiesmentioning

confidence: 99%

Section: Validation Of Automatic Metricsmentioning

confidence: 99%

Can machine translation systems be evaluated by the crowd alone

et al. 2015

View full text Add to dashboard Cite

show abstract

“…This assumption is also true of most of the 39 automated measures submitted to the NIST 2008 Metrics for Machine Translation Challenge (Przybocki et al 2009). Measures based on exact matching of system outputs to references, including the Word Error Rate (WER) measure used to score automatic speech recognition (ASR), are at a disadvantage when applied to data that contains much variation which is unrelated to translation quality.…”

Section: Challenges For Automated Metricsmentioning

confidence: 95%

Evaluation of 2-way Iraqi Arabic–English speech translation systems using automated metrics

Condon

Arehart

Parvaz

et al. 2011

Machine Translation

Self Cite

View full text Add to dashboard Cite

The Defense Advanced Research Projects Agency (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program (http://1.usa.gov/transtac) faced many challenges in applying automated measures of translation quality to Iraqi Arabic-English speech translation dialogues. Features of speech data in general and of Iraqi Arabic data in particular undermine basic assumptions of automated measures that depend on matching system outputs to reference translations. These features are described along with the challenges they present for evaluating machine translation quality using automated metrics. We show that scores for translation into Iraqi Arabic exhibit higher correlations with human judgments when they are computed from normalized system outputs and reference translations. Orthographic normalization, lexical normalization, and operations involving light stemming resulted in higher correlations with human judgments.

show abstract

“…Communication patterns in this domain have not been studied in significant detail either. The metrics used to assess MT quality in competitive evaluations (Przybocki et al 2009) and the industry (Roturier 2009) also appear to overlook the collaborative nature of the task.…”

Section: Supporting Collaboration Between Translators and Across Sitesmentioning

confidence: 99%

“…930-933). Automatic metrics are tested in terms of their correlation with such judgements (Przybocki et al 2009). …”

Section: Introductionmentioning

confidence: 99%

Translation practice in the workplace: contextual analysis and implications for machine translation

Karamanis¹,

Luz²,

Doherty³

2011

Machine Translation

View full text Add to dashboard Cite

This paper reports the results of a qualitative study which investigated localisation activities performed by translators working in two Language Service Providers. It argues that maintaining the appropriate quality level in this setting is a collaborative task which involves several translators. This perspective entails taking a broader view of the translation process than usually found in the Machine Translation (MT) literature and detailing the various knowledge sources which are deployed in this collaborative effort. The impact of collaboration on trust is examined, and a comparison is made between the relatively seamless flow of work between translators and the more strained relationships with remote contributors. In support of this view, the paper contrasts the flexibility of the analysed work practices with the rigid ways which tend to be followed when introducing MT into this setting. We identify the need to support collaboration and communication more actively as a broader issue in translation settings. While current strategies for introducing MT tend to further isolate translators from remote contributors, we propose that MT can serve as the catalyst for establishing a more dynamic and collaborative relationship between them.

show abstract

The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results

Cited by 41 publications

References 21 publications

Can machine translation systems be evaluated by the crowd alone

Can machine translation systems be evaluated by the crowd alone

Evaluation of 2-way Iraqi Arabic–English speech translation systems using automated metrics

Translation practice in the workplace: contextual analysis and implications for machine translation

Contact Info

Product

Resources

About