MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Zhao, Wei; Peyrard, Maxime; Liu, Fei; Gao, Yang; Meyer, Christian M.; Eger, Steffen

doi:10.48550/arxiv.1909.02622

Cited by 33 publications

(51 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This suggests that the sketching step helps generate a more fluent summary even with lower unigram matching. Furthermore, recognizing the limitation of ROUGE scores in their ability to fully capture the resemblance between the generated summary and the reference, in Table 2, we follow (Fabbri et al, 2020) rics, including ROUGE-Word Embedding (Ng and Abrecht, 2015), BERTScore (Zhang et al, 2019b), MoverScore (Zhao et al, 2019), Sentence Mover's Similarity (SMS) (Clark et al, 2019), BLEU (Papineni et al, 2002), and CIDEr (Vedantam et al, 2015). As shown in Table 2, CODS consistently outperforms PEGASUS and BART.…”

Section: Resultsmentioning

confidence: 94%

Controllable Abstractive Dialogue Summarization with Sketch Supervision

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we aim to improve abstractive dialogue summarization quality and, at the same time, enable granularity control. Our model has two primary components and stages: 1) a two-stage generation strategy that generates a preliminary summary sketch serving as the basis for the final summary. This summary sketch provides a weakly supervised signal in the form of pseudo-labeled interrogative pronoun categories and key phrases extracted using a constituency parser. 2) A simple strategy to control the granularity of the final summary, in that our model can automatically determine or control the number of generated summary sentences for a given dialogue by predicting and highlighting different text spans from the source text. Our model achieves state-of-theart performance on the largest dialogue summarization corpus SAMSum, with as high as 50.79 in ROUGE-L score. In addition, we conduct a case study and show competitive human evaluation results and controllability to humanannotated summaries. * Equal contribution. Work mainly done when Linqing Liu was an intern at Salesforce Research.

show abstract

Section: Resultsmentioning

confidence: 94%

Controllable Abstractive Dialogue Summarization with Sketch Supervision

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Baselines include BLEURT, described in Section 2.3), along with BERTScore, a non-learned neural metric that uses a matching algorithm on top of neural word embeddings, similar to n-gram matching approaches. MoverScore [42] is similar to BERTScore, but uses an optimal transport algorithm. BLEU, ROUGE, METEOR and chrF++ are widely used n-gram-based methods, working at the word, subword or character level.…”

Section: Resultsmentioning

confidence: 99%

Generative Pretraining for Paraphrase Evaluation

Weston¹,

Lenain²,

Meepegama³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce ParaBLEU, a paraphrase representation learning model and evaluation metric for text generation. Unlike previous approaches, ParaBLEU learns to understand paraphrasis using generative conditioning as a pretraining objective. ParaBLEU correlates more strongly with human judgements than existing metrics, obtaining new state-of-the-art results on the 2017 WMT Metrics Shared Task. We show that our model is robust to data scarcity, exceeding previous state-of-the-art performance using only 50% of the available training data and surpassing BLEU, ROUGE and METEOR with only 40 labelled examples. Finally, we demonstrate that ParaBLEU can be used to conditionally generate novel paraphrases from a single demonstration, which we use to confirm our hypothesis that it learns abstract, generalized paraphrase representations.Preprint. Under review.

show abstract

“…We use automatic metrics BLEU (Papineni et al, 2002), METEOR (Denkowski and Lavie, 2014) and a neural-based metric MoverScore (Zhao et al, 2019). As automatic scores remain tricky for correctly evaluating the text quality, we conduct human evaluation.…”

Section: Evaluation Metricsmentioning

confidence: 99%

SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation

Chen¹,

Takamura²,

Nakayama³

2021

Preprint

View full text Add to dashboard Cite

Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called context. We push forward the scientific text generation by proposing a new task, namely context-aware text generation in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging largescale Scientific Paper Dataset for ConteXt-Aware Text Generation (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciX-Gen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly available to hopefully facilitate the scientific text generation research.

show abstract

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Cited by 33 publications

References 38 publications

Controllable Abstractive Dialogue Summarization with Sketch Supervision

Controllable Abstractive Dialogue Summarization with Sketch Supervision

Generative Pretraining for Paraphrase Evaluation

SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation

Contact Info

Product

Resources

About