A Human Evaluation of AMR-to-English Generation Systems

Manning, Emma; Wein, Shira; Schneider, Nathan

doi:10.18653/v1/2020.coling-main.420

Cited by 13 publications

(16 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, all of the aforementioned metrics return scores that are hardly interpretable and we cannot tell what exactly they have measured. These problems carry over to the evaluation of AMR-to-text generation: May and Priyadarshi (2017) find that BLEU does not well correspond to human ratings of generations from AMR, and Manning et al (2020) show through human analysis that none of the existing automatic metrics can provide nuanced views on generation quality. Our proposal MF β takes a first step to address these issues by aiming at a clear separation of form and meaning, as called for by Bender and Koller (2020).…”

Section: Related Workmentioning

confidence: 99%

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

Opitz

Frank

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

Systems that generate natural language text from abstract meaning representations such as AMR are typically evaluated using automatic surface matching metrics that compare the generated texts to reference texts from which the input meaning representations were constructed. We show that besides wellknown issues from which such metrics suffer, an additional problem arises when applying these metrics for AMR-to-text evaluation, since an abstract meaning representation allows for numerous surface realizations. In this work we aim to alleviate these issues by proposing MF β , a decomposable metric that builds on two pillars. The first is the principle of meaning preservation M: it measures to what extent a given AMR can be reconstructed from the generated sentence using SOTA AMR parsers and applying (finegrained) AMR evaluation metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form F that measures the linguistic quality of the generated text, which we implement using SOTA language models. In two extensive pilot studies we show that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores. Since MF β does not necessarily rely on gold AMRs, it may extend to other text generation tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

Opitz

Frank

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

show abstract

“…This is currently not the case; new models are evaluated on different datasets, most of which focus only on the English language (Bender, 2019), and using these flawed metrics. Moreover, while human evaluations of generated texts can provide complementary insights to automatic evaluation (Manning et al, 2020), it can also lead to contradicting results since studies often omit crucial replication details and assume different definitions of the measured quantities .…”

Section: Introductionmentioning

confidence: 99%

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Gehrmann¹,

Adewumi²,

Aggarwal³

et al. 2021

Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

View full text Add to dashboard Cite

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with wellestablished, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

show abstract

“…Evaluation Measures As a primary metric, we evaluate generated text using BLEU (Papineni et al, 2002), calculated with SacreBLEU (Post, 2018). Despite its limitations in generation settings, BLEU still generally accords with rankings of models, either by human evaluations or by alternate metrics (Manning et al, 2020). We also evaluate our scaffolding models ( §4) using BertScore , which measures token similarity with contextual embeddings, permitting a more nuanced measure of semantic similarity.…”

Section: Introductionmentioning

confidence: 99%

Promoting Graph Awareness in Linearized Graph-to-Text Generation

Alexander¹,

Marasovi²,

Smith³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Generating text from structured inputs, such as meaning representations or RDF triples, has often involved the use of specialized graphencoding neural networks. However, recent applications of pretrained transformers to linearizations of graph inputs have yielded stateof-the-art generation results on graph-to-text tasks. Here, we explore the ability of these linearized models to encode local graph structures, in particular their invariance to the graph linearization strategy and their ability to reconstruct corrupted inputs. Our findings motivate solutions to enrich the quality of models' implicit graph encodings via scaffolding. Namely, we use graph-denoising objectives implemented in a multi-task text-to-text framework. We find that these denoising scaffolds lead to substantial improvements in downstream generation in low-resource settings.

show abstract

A Human Evaluation of AMR-to-English Generation Systems

Cited by 13 publications

References 23 publications

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Promoting Graph Awareness in Linearized Graph-to-Text Generation

Contact Info

Product

Resources

About