A Human Judgement Corpus and a Metric for Arabic MT Evaluation

Bouamor, Houda; Alshikhabobakr, Hanan; Mohit, Behrang; Oflazer, Kemal

doi:10.3115/v1/d14-1026

Cited by 25 publications

(30 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Bouamor et al (2014) presents correlations for both standard BLEU and a modified version called AL-BLEU. I included the correlation with standard BLEU, but not AL-BLEU.…”

Section: Screening Papersmentioning

confidence: 99%

“…Some of the papers surveyed (as well as many of the papers I excluded) gave interesting qualitative analyses of cases when BLEU provides misleading results. For example, Bouamor et al (2014) explain BLEU's weaknesses in evaluating texts in morphologically rich languages such as Arabic, and Espinosa et al (2010) point out that BLEU inappropriately penalizes texts that have different adverbial placement compared with reference texts. These comments are interesting and valuable research contributions, but in this structured review my focus is on quantitative correlations between BLEU and human evaluations.…”

Section: Extracting Information From Papersmentioning

confidence: 99%

See 1 more Smart Citation

A Structured Review of the Validity of BLEU

Reiter

2018

Computational Linguistics

224

135

View full text Add to dashboard Cite

The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.

show abstract

“…For example, Bouamor et al (2014) presents correlations for both standard BLEU and a modified version called AL-BLEU. I included the correlation with standard BLEU, but not AL-BLEU.…”

Section: Screening Papersmentioning

confidence: 99%

Section: Extracting Information From Papersmentioning

confidence: 99%

A Structured Review of the Validity of BLEU

Reiter

2018

Computational Linguistics

224

135

View full text Add to dashboard Cite

show abstract

“…For instance, VATEX-zh has more nouns and verbs but fewer adjectives than VATEX-en, because the semantics of many Chinese adjectives are included in nouns or verbs [71] 4 . tors is costly and time-consuming. Thus, following previous methods [8,68] on collecting parallel pairs, we choose the post-editing annotation strategy. Particularly, for each video, we randomly sample 5 captions from the annotated 10 English captions and use multiple translation systems to translate them into Chinese reference sentences.…”

Section: Chinese Description Collectionmentioning

confidence: 99%

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Wang

Chen³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

313

213

View full text Add to dashboard Cite

We present a new large-scale multilingual video description dataset, VATEX 1 , which contains over 41, 250 videos and 825, 000 captions in both English and Chinese. Among the captions, there are over 206, 000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset [66], VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on VATEX:(1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the VATEX dataset show that, first, the unified multilingual model can not only produce both English and Chinese descriptions for a video more efficiently, but also offer improved performance over the monolingual models. Furthermore, we demonstrate that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation. In the end, we discuss the potentials of using VATEX for other video-and-language research. * Equal contribution. 1 VATEX stands for Video And TEXt, where X also represents various languages.

show abstract

“…2019), which included translations of tourism-related texts. There have been also a number of other multi-dialectal corpora compiled for Arabic including a parallel corpus of 2000 sentences in English, MSA, and multiple Arabic dialects (Bouamor, Habash, and Oflazer 2014); a corpus from web forums with data from eighteen Arabic-speaking countries (Sadat et al . 2014); as well as some multi-dialect corpora consisting of Twitter posts (Elgabou and Kazakov 2017; Alshutayri and Atwell 2017).…”

Section: Applicationsmentioning

confidence: 99%

Natural language processing for similar languages, varieties, and dialects: A survey

Zampieri

Nakov²,

Scherrer

2020

Nat. Lang. Eng.

View full text Add to dashboard Cite

There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.

show abstract

A Human Judgement Corpus and a Metric for Arabic MT Evaluation

Cited by 25 publications

References 13 publications

A Structured Review of the Validity of BLEU

A Structured Review of the Validity of BLEU

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Natural language processing for similar languages, varieties, and dialects: A survey

Contact Info

Product

Resources

About