“…Existing work often limits model comparisons to only a few baselines and offers human evaluations which are largely inconsistent with prior work. Additionally, despite problems associated with ROUGE when used outside of its original setting (Liu and Liu, 2008;Cohan and Goharian, 2016) as well as the introduction of many variations on ROUGE (Zhou et al, 2006;Ng and Abrecht, 2015;Ganesan, 2015;ShafieiBavani et al, 2018) and other text generation metrics (Peyrard, 2019;Zhao et al, 2019;Zhang et al, 2020;Scialom et al., 2019;Clark et al, 2019), ROUGE has remained the default automatic evaluation metric. We believe that the shortcomings of the current evaluation protocol are partially caused by the lack of easy-to-use resources for evaluation, both in the form of simplified evaluation toolkits and large collections of model outputs.…”