Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014
DOI: 10.3115/v1/d14-1026
|View full text |Cite
|
Sign up to set email alerts
|

A Human Judgement Corpus and a Metric for Arabic MT Evaluation

Abstract: We present a human judgments dataset and an adapted metric for evaluation of Arabic machine translation. Our mediumscale dataset is the first of its kind for Arabic with high annotation quality. We use the dataset to adapt the BLEU score for Arabic. Our score (AL-BLEU) provides partial credits for stem and morphological matchings of hypothesis and reference words. We evaluate BLEU, METEOR and AL-BLEU on our human judgments corpus and show that AL-BLEU has the highest correlation with human judgments. We are re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(30 citation statements)
references
References 13 publications
0
19
0
Order By: Relevance
“…For example, Bouamor et al (2014) presents correlations for both standard BLEU and a modified version called AL-BLEU. I included the correlation with standard BLEU, but not AL-BLEU.…”
Section: Screening Papersmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, Bouamor et al (2014) presents correlations for both standard BLEU and a modified version called AL-BLEU. I included the correlation with standard BLEU, but not AL-BLEU.…”
Section: Screening Papersmentioning
confidence: 99%
“…Some of the papers surveyed (as well as many of the papers I excluded) gave interesting qualitative analyses of cases when BLEU provides misleading results. For example, Bouamor et al (2014) explain BLEU's weaknesses in evaluating texts in morphologically rich languages such as Arabic, and Espinosa et al (2010) point out that BLEU inappropriately penalizes texts that have different adverbial placement compared with reference texts. These comments are interesting and valuable research contributions, but in this structured review my focus is on quantitative correlations between BLEU and human evaluations.…”
Section: Extracting Information From Papersmentioning
confidence: 99%
“…For instance, VATEX-zh has more nouns and verbs but fewer adjectives than VATEX-en, because the semantics of many Chinese adjectives are included in nouns or verbs [71] 4 . tors is costly and time-consuming. Thus, following previous methods [8,68] on collecting parallel pairs, we choose the post-editing annotation strategy. Particularly, for each video, we randomly sample 5 captions from the annotated 10 English captions and use multiple translation systems to translate them into Chinese reference sentences.…”
Section: Chinese Description Collectionmentioning
confidence: 99%
“…2019), which included translations of tourism-related texts. There have been also a number of other multi-dialectal corpora compiled for Arabic including a parallel corpus of 2000 sentences in English, MSA, and multiple Arabic dialects (Bouamor, Habash, and Oflazer 2014); a corpus from web forums with data from eighteen Arabic-speaking countries (Sadat et al . 2014); as well as some multi-dialect corpora consisting of Twitter posts (Elgabou and Kazakov 2017; Alshutayri and Atwell 2017).…”
Section: Applicationsmentioning
confidence: 99%