Daniel Deutsch scite author profile

Daniel Deutsch

5Publications

103Citation Statements Received

153Citation Statements Given

How they've been cited

102

How they cite others

149

Affiliations

University of Pennsylvania

Publications

Order By: Most citations

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Deutsch

Dror

Roth

2021

View full text Add to dashboard Cite

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do so in some evaluation settings.1

show abstract

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

Deutsch

Bedrax-Weiss

Roth

2021

View full text Add to dashboard Cite

A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics such as ROUGE fail to achieve this because they are limited to matching tokens, either lexically or via embeddings. In this work, we propose a metric to evaluate the content quality of a summary using question-answering (QA). QA-based methods directly measure a summary’s information overlap with a reference, making them fundamentally different than text overlap metrics. We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval. QAEval outperforms current state-of-the-art metrics on most evaluations using benchmark datasets, while being competitive on others due to limitations of state-of-the-art models. Through a careful analysis of each component of QAEval, we identify its performance bottlenecks and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.1

show abstract

A Distributional and Orthographic Aggregation Model for English Derivational Morphology

Deutsch

Hewitt

Roth

2018

View full text Add to dashboard Cite

Modeling derivational morphology to generate words with particular semantics is useful in many text generation tasks, such as machine translation or abstractive question answering. In this work, we tackle the task of derived word generation. That is, given the word "run," we attempt to generate the word "runner" for "someone who runs." We identify two key problems in generating derived words from root words and transformations: suffix ambiguity and orthographic irregularity. We contribute a novel aggregation model of derived word generation that learns derivational transformations both as orthographic functions using sequence-to-sequence models and as functions in distributional word embedding space. Our best open-vocabulary model, which can generate novel words, and our best closed-vocabulary model, show 22% and 37% relative error reductions over current state-of-the-art systems on the same dataset.

show abstract

Understanding the Extent to which Content Quality Metrics Measure the Information Quality of Summaries

Deutsch¹,

Roth²

2021

View full text Add to dashboard Cite

Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference. Ideally, this comparison should measure the summary's information quality by calculating how much information the summaries have in common. In this work, we analyze the token alignments used by ROUGE and BERTScore to compare summaries and argue that their scores largely cannot be interpreted as measuring information overlap. Rather, they are better estimates of the extent to which the summaries discuss the same topics. Further, we provide evidence that this result holds true for many other summarization evaluation metrics. The consequence of this result is that the most frequently used summarization evaluation metrics do not align with the community's research goal, to generate summaries with high-quality information. However, we conclude by demonstrating that a recently proposed metric, QAEval, which scores summaries using question-answering, appears to better capture information quality than current evaluations, highlighting a direction for future research.

show abstract

SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics

Deutsch¹,

Roth²

2020

View full text Add to dashboard Cite

We present SacreROUGE, an open-source library for using and developing summarization evaluation metrics. 1 SacreROUGE removes many obstacles that researchers face when using or developing metrics: (1) The library provides Python wrappers around the official implementations of existing evaluation metrics so they share a common, easy-to-use interface;(2) it provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments, so no additional code needs to be written for a new evaluation metric; and (3) it includes scripts for loading datasets that contain human judgments so they can easily be used for evaluation. This work describes the design of the library, including the core Metric interface, the command-line API for evaluating summarization models and metrics, and the scripts to load and reformat publicly available datasets. The development of SacreROUGE is ongoing and open to contributions from the community.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Daniel Deutsch

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

A Distributional and Orthographic Aggregation Model for English Derivational Morphology

Understanding the Extent to which Content Quality Metrics Measure the Information Quality of Summaries

SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics

Contact Info

Product

Resources

About