Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

Kann, Katharina; Rothe, Sascha; Filippova, Katja

doi:10.18653/v1/k18-1031

Cited by 50 publications

(49 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Likability quantifies how much a set of one or more qualities makes a response more likable for a particular task. These qualities can be diversity (Li et al, 2016), sentiment (Rashkin et al, 2019), specificity (Ke et al, 2018), engagement (Yi et al, 2019), fluency (Kann et al, 2018) and more. A likable response may or may not be sensible to the context.…”

Section: Fundamental Aspectsmentioning

confidence: 99%

“…Moreover, instead of using both (c, r), as in Mehri and Eskenazi (2020), we only use the response r to ensure the independence from the context c. Therefore, for a response r with m words, we sequentially mask one word at a time and feed it into BERT-MLM to predict negative log-likelihood (MLM-Likelihood) of all masked words. We also investigate negative cross-entropy (MLM-NCE), perplexity (MLM-PPL), and MLM-SLOR (Kann et al, 2018) to verify if they can be used for the understandability and specificity aspects.…”

Section: Metrics For Fundamental Aspectsmentioning

confidence: 99%

See 1 more Smart Citation

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Phy¹,

Zhao²,

Aizawa³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Many automatic evaluation metrics have been proposed to score the overall quality of a response in open-domain dialogue. Generally, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy, and the importance of each aspect differs according to the task. For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system. However, existing metrics are not designed to cope with such flexibility. For example, BLEU score fundamentally relies only on word overlapping, whereas BERTScore relies on semantic similarity between reference and candidate response. Thus, they are not guaranteed to capture the required aspects, i.e., specificity. To design a metric that is flexible to a task, we first propose making these qualities manageable by grouping them into three groups: understandability, sensibleness, and likability, where likability is a combination of qualities that are essential for a task. We also propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy 1 . We demonstrated that USL-H score achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics. Context: I'm sorry. It's out of stock now. Could you come by again next week?

show abstract

Section: Fundamental Aspectsmentioning

confidence: 99%

Section: Metrics For Fundamental Aspectsmentioning

confidence: 99%

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Phy¹,

Zhao²,

Aizawa³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Unfortunately, it offers no way to evaluate dialogues without a specified ground truth. On another note, Kann et al (2018) suggest a sentence level fluency metric derived from the perplexity score of a language model given a sentence without involving any references. Their results demonstrate significant positive correlations with human annotators.…”

Section: Dialogue Evaluationmentioning

confidence: 99%

“…Since intuition dictates that responses are dependent on their preceding context, we condition the target reply on its history to measure its relevance. Kann et al (2018) showed how language models could serve as good sentence-level fluency indicators. Thus, the calculated probability from the transformer-based LM can serve as a combined score for fluency and coherency.…”

Section: Language Model Evaluatorsmentioning

confidence: 99%

Language Model Transformers as Evaluators for Open-domain Dialogues

Nedelchev¹,

Lehmann²,

Usbeck³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Computer-based systems for communication with humans are a cornerstone of AI research since the 1950s. So far, the most effective way to assess the quality of the dialogues produced by these systems is to use resource-intensive manual labor instead of automated means. In this work, we investigate whether language models (LM) based on transformer neural networks can indicate the quality of a conversation. In a general sense, language models are methods that learn to predict one or more words based on an already given context. Due to their unsupervised nature, they are candidates for efficient, automatic indication of dialogue quality. We demonstrate that human evaluators have a positive correlation between the output of the language models and scores. We also provide some insights into their behavior and inner-working in a conversational context.

show abstract

“…Heuristicbased evaluation was found to be effective for grammatical error correction, but the methods used were problem-specific and cannot be extended to other tasks (Napoles et al, 2016;Choshen and Abend, 2018;Asano et al, 2017). Using the log-odds from a language model, Kann et al (2018) made automatic judgments of sentence-level fluency that correlated moderately well with human judgment, but this captured only one facet of language quality. Approaches that were broader in scope found less success: although ADEM, an RNN trained to score dialogue responses, was initially thought to correlate well with human judgment (Lowe et al, 2017), it was later found to generalize poorly, placing outsized influence on factors such as response length (Lowe, 2019).…”

Section: Introductionmentioning

confidence: 99%

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Ethayarajh¹,

Sadigh²

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

Evaluation is a bottleneck in the development of natural language generation (NLG) models. Automatic metrics such as BLEU rely on references, but for tasks such as open-ended generation, there are no references to draw upon. Although language diversity can be estimated using statistical measures such as perplexity, measuring language quality requires human evaluation. However, because human evaluation at scale is slow and expensive, it is used sparingly; it cannot be used to rapidly iterate on NLG models, in the way BLEU is used for machine translation. To this end, we propose BLEU Neighbors, a nearest neighbors model for estimating language quality by using the BLEU score as a kernel function. On existing datasets for chitchat dialogue and open-ended sentence generation, we find that -on average -the quality estimation from a BLEU Neighbors model has a lower mean squared error and higher Spearman correlation with the ground truth than individual human annotators. Despite its simplicity, BLEU Neighbors even outperforms state-of-the-art models on automatically grading essays, including models that have access to a gold-standard reference essay.

show abstract

Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

Cited by 50 publications

References 35 publications

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Language Model Transformers as Evaluators for Open-domain Dialogues

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Contact Info

Product

Resources

About