Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.420
|View full text |Cite
|
Sign up to set email alerts
|

A Human Evaluation of AMR-to-English Generation Systems

Abstract: Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our results compare to tho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
15
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 13 publications
(16 citation statements)
references
References 23 publications
1
15
0
Order By: Relevance
“…However, all of the aforementioned metrics return scores that are hardly interpretable and we cannot tell what exactly they have measured. These problems carry over to the evaluation of AMR-to-text generation: May and Priyadarshi (2017) find that BLEU does not well correspond to human ratings of generations from AMR, and Manning et al (2020) show through human analysis that none of the existing automatic metrics can provide nuanced views on generation quality. Our proposal MF β takes a first step to address these issues by aiming at a clear separation of form and meaning, as called for by Bender and Koller (2020).…”
Section: Related Workmentioning
confidence: 99%
“…However, all of the aforementioned metrics return scores that are hardly interpretable and we cannot tell what exactly they have measured. These problems carry over to the evaluation of AMR-to-text generation: May and Priyadarshi (2017) find that BLEU does not well correspond to human ratings of generations from AMR, and Manning et al (2020) show through human analysis that none of the existing automatic metrics can provide nuanced views on generation quality. Our proposal MF β takes a first step to address these issues by aiming at a clear separation of form and meaning, as called for by Bender and Koller (2020).…”
Section: Related Workmentioning
confidence: 99%
“…This is currently not the case; new models are evaluated on different datasets, most of which focus only on the English language (Bender, 2019), and using these flawed metrics. Moreover, while human evaluations of generated texts can provide complementary insights to automatic evaluation (Manning et al, 2020), it can also lead to contradicting results since studies often omit crucial replication details and assume different definitions of the measured quantities .…”
Section: Introductionmentioning
confidence: 99%
“…Evaluation Measures As a primary metric, we evaluate generated text using BLEU (Papineni et al, 2002), calculated with SacreBLEU (Post, 2018). Despite its limitations in generation settings, BLEU still generally accords with rankings of models, either by human evaluations or by alternate metrics (Manning et al, 2020). We also evaluate our scaffolding models ( §4) using BertScore , which measures token similarity with contextual embeddings, permitting a more nuanced measure of semantic similarity.…”
Section: Introductionmentioning
confidence: 99%