Proceedings of the 22nd Conference on Computational Natural Language Learning 2018
DOI: 10.18653/v1/k18-1031
|View full text |Cite
|
Sign up to set email alerts
|

Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

Abstract: Motivated by recent findings on the probabilistic modeling of acceptability judgments, we propose syntactic log-odds ratio (SLOR), a normalized language model score, as a metric for referenceless fluency evaluation of natural language generation output at the sentence level. We further introduce WPSLOR, a novel WordPiece-based version, which harnesses a more compact language model. Even though word-overlap metrics like ROUGE are computed with the help of hand-written references, our referenceless methods obtai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
49
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 50 publications
(49 citation statements)
references
References 35 publications
0
49
0
Order By: Relevance
“…Likability quantifies how much a set of one or more qualities makes a response more likable for a particular task. These qualities can be diversity (Li et al, 2016), sentiment (Rashkin et al, 2019), specificity (Ke et al, 2018), engagement (Yi et al, 2019), fluency (Kann et al, 2018) and more. A likable response may or may not be sensible to the context.…”
Section: Fundamental Aspectsmentioning
confidence: 99%
See 1 more Smart Citation
“…Likability quantifies how much a set of one or more qualities makes a response more likable for a particular task. These qualities can be diversity (Li et al, 2016), sentiment (Rashkin et al, 2019), specificity (Ke et al, 2018), engagement (Yi et al, 2019), fluency (Kann et al, 2018) and more. A likable response may or may not be sensible to the context.…”
Section: Fundamental Aspectsmentioning
confidence: 99%
“…Moreover, instead of using both (c, r), as in Mehri and Eskenazi (2020), we only use the response r to ensure the independence from the context c. Therefore, for a response r with m words, we sequentially mask one word at a time and feed it into BERT-MLM to predict negative log-likelihood (MLM-Likelihood) of all masked words. We also investigate negative cross-entropy (MLM-NCE), perplexity (MLM-PPL), and MLM-SLOR (Kann et al, 2018) to verify if they can be used for the understandability and specificity aspects.…”
Section: Metrics For Fundamental Aspectsmentioning
confidence: 99%
“…Unfortunately, it offers no way to evaluate dialogues without a specified ground truth. On another note, Kann et al (2018) suggest a sentence level fluency metric derived from the perplexity score of a language model given a sentence without involving any references. Their results demonstrate significant positive correlations with human annotators.…”
Section: Dialogue Evaluationmentioning
confidence: 99%
“…Since intuition dictates that responses are dependent on their preceding context, we condition the target reply on its history to measure its relevance. Kann et al (2018) showed how language models could serve as good sentence-level fluency indicators. Thus, the calculated probability from the transformer-based LM can serve as a combined score for fluency and coherency.…”
Section: Language Model Evaluatorsmentioning
confidence: 99%
“…Heuristicbased evaluation was found to be effective for grammatical error correction, but the methods used were problem-specific and cannot be extended to other tasks (Napoles et al, 2016;Choshen and Abend, 2018;Asano et al, 2017). Using the log-odds from a language model, Kann et al (2018) made automatic judgments of sentence-level fluency that correlated moderately well with human judgment, but this captured only one facet of language quality. Approaches that were broader in scope found less success: although ADEM, an RNN trained to score dialogue responses, was initially thought to correlate well with human judgment (Lowe et al, 2017), it was later found to generalize poorly, placing outsized influence on factors such as response length (Lowe, 2019).…”
Section: Introductionmentioning
confidence: 99%