“…In text summarization, manual evaluation, as exemplified by the Pyramid method (Nenkova and Passonneau, 2004), is the gold-standard in evaluation. However, due to time required and relatively high cost of annotation, the great majority of research papers on summarization use exclusively automatic evaluation metrics, such as ROUGE (Lin, 2004) , JS-2 (Louis and Nenkova, 2013), S3 (Peyrard et al, 2017), BERTScore (Zhang et al, 2020), Mover-Score (Zhao et al, 2019) etc. Among these metrics, ROUGE is by far the most popular, and there is relatively little discussion of how ROUGE may deviate from human judgment and the potential for this deviation to change conclusions drawn regarding relative merit of baseline and proposed methods.…”