The price of debiasing automatic metrics in natural language evalaution

Chaganty, Arun Tejasvi; Mussmann, Stephen; Liang, Percy

doi:10.18653/v1/p18-1060

Cited by 95 publications

(96 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results reveal that the lexical metric SENTBLEU can correctly assign lower scores to system translations of low quality, while it struggles in judging system translations of high quality by assigning them lower scores. Our finding agrees with the observations found in Chaganty et al (2018); Novikova et al (2017): lexical metrics correlate better with human judgments on texts of low quality than high quality. Peyrard (2019b) further show that lexical metrics cannot be trusted because Machine Translation (zh-en) Figure 3: Correlation in similar language (de-en) and distant language (zh-en) pair, where bordered area shows correlations between human assessment and metrics, the rest shows inter-correlations across metrics and DA is direct assessment rated by language experts.…”

Section: Further Analysissupporting

confidence: 92%

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Zhao¹,

Peyrard²,

Liu³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

316

263

View full text Add to dashboard Cite

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service. 1

show abstract

Section: Further Analysissupporting

confidence: 92%

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Zhao¹,

Peyrard²,

Liu³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

316

263

View full text Add to dashboard Cite

show abstract

“…METEOR, in contrast takes synonymy into account, and our methods outperformed previous systems in this metric. Our observation follows recent published work on evaluating abstractive NLI systems (Chaganty et al, 2018). Concurrently with improving NLI methodology, it is worth investing in the development of evaluation methods that reflect progress faithfully.…”

Section: Resultssupporting

confidence: 72%

“…We also find that automatic evaluation scores like BLEU and METEOR, which rely on word overlap, are overly conservative regarding the output of our model. A series of recent papers discussed problems of comparing models on abstractive NLI tasks using automatic metrics as the ones listed above (Novikova et al, 2017;Chaganty et al, 2018). While there is decent agreement between human and automatic judgments on bad model outputs, disagreements tend to be substantial on good outputs.…”

Section: Error Analysismentioning

confidence: 99%

Extractive NarrativeQA with Heuristic Pre-Training

Frermann

2019

Proceedings of the 2nd Workshop on Machine Reading for Question Answering

View full text Add to dashboard Cite

Although advances in neural architectures for NLP problems and unsupervised pre-training led to impressive improvements on question answering and natural language inference, reasoning over long texts still poses a great challenge. Here, we consider the task of question answering from full narratives (e.g., books or movie scripts), or their summaries, tackling the NarrativeQA challenge (NQA; Kocisky et al. (2018)). We introduce a heuristic extractive version of the data set, which allows us to approach the more feasible problem of answer extraction (rather than generation). We develop models for passage retrieval and answer span prediction using this data set. We use pre-trained BERT embeddings for injecting prior knowledge into our system. We show that our setup leads to state of the art performance on summary-level QA. On narrativelevel QA, our model performs competitively on the METEOR metric. We analyze the relative contributions of BERT embeddings and the extractive model setup, and provide a detailed error analysis.

show abstract

“…Sentence Level Discrimination. Chaganty et al (2018) point out that hill-climbing on an automatic metric is meaningless if that metric has a low instance-level correlation to human judgments. In Table 3 we show the average accuracy of the metrics in making the same judgments as humans between pairs of generated texts.…”

Section: Discussionmentioning

confidence: 99%

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation

Dhingra¹,

Faruqui²,

Parikh³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

105

123

View full text Add to dashboard Cite

Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio (Lebret et al., 2016), often contain reference texts that diverge from the information in the corresponding semistructured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PAR-ENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed in Wiseman et al. (2017), and show that PAR-ENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge. 1 * Work done during an internship at Google. 1 Code and Data:

show abstract

The price of debiasing automatic metrics in natural language evalaution

Cited by 95 publications

References 26 publications

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Extractive NarrativeQA with Heuristic Pre-Training

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation

Contact Info

Product

Resources

About