“…Though all have strengths and weaknesses, ROUGE metrics (particularly ROUGE-L) are common for multisentence text evaluations. Textual metrics that consider specific qualities in the system outputs, like complexity and diversity, are also used to evaluate NLG systems (Dusek et al, 2019;Hashimoto et al, 2019;Sagarkar et al, 2018;Purdy et al, 2018). Word mover's distance has recently been used for NLP tasks like learning word embeddings (Zhang et al, 2017;Wu et al, 2018), textual entailment (Sulea, 2017), document similarity and classification (Kusner et al, 2015;Huang et al, 2016;Atasu et al, 2017), image captioning (Kilickaya et al, 2017), document retrieval (Balikas et al, 2018), clustering for semantic word-rank (Zhang and Wang, 2018), and as additional loss for text generation that measures the optimal transport between the generated hypothesis and reference text (Chen et al, 2019).…”