Mix Multiple Features to Evaluate the Content and the Linguistic Quality of Text Summaries

With an increasing number of new summarization systems proposed in recent years, an automatic text evaluation metric that can accurately and reliably rate the performance of summarization systems has been a pressing need. However, current automatic text evaluation metrics can only measure one or certain aspects of the quality between two summary texts and do not agree with human judgments consistently. In this paper, we show that combining multiple well-chosen evaluation metrics and training predictive models using human annotated datasets can lead to more reliable evaluation scores than using any individual automatic metric. Our predictive models trained on a human annotated subset of the CNN/DailyMail corpus demonstrate significant improvements (e.g. approximately 25% along coherence dimension) over selected individual metrics. Furthermore, a concise meta-evaluation on automatic metrics is provided along with an analysis of the performance of 12 predictive models. We also investigate the sensitivity of automatic metrics when mixed together for training these models. We have made the code, the instructions for experiment setup, and the trained models available as a tool for comparing and evaluating text summarization systems. a

show abstract

Mix Multiple Features to Evaluate the Content and the Linguistic Quality of Text Summaries

Cited by 2 publications

References 21 publications

Towards an Arabic Text Summaries Evaluation Based on AraBERT Model

Towards an Arabic Text Summaries Evaluation Based on AraBERT Model

Towards a Reliable Text Summarization Evaluation Metric Using Predictive Models

Contact Info

Product

Resources

About