Lower Perplexity is Not Always Human-Like

Kuribayashi, Tatsuki; Oseki, Yohei; Ito, Toshiro; Yoshida, Ryo; Asahara, Masayuki; Inui, Kentaro

doi:10.18653/v1/2021.acl-long.405

Cited by 29 publications

(31 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3). called into question (Kuribayashi et al, 2021). As such, while we find convincing preliminary evidence in our analyzed languages, we are not able to fully test the hypothesis that the pressure for UID is at the language-level.…”

Section: Discussioncontrasting

confidence: 73%

Revisiting the Uniform Information Density Hypothesis

Meister¹,

Pimentel²,

Haller³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

The uniform information density (UID) hypothesis posits a preference among language users for utterances structured such that information is distributed uniformly across a signal. While its implications on language production have been well explored, the hypothesis potentially makes predictions about language comprehension and linguistic acceptability as well. Further, it is unclear how uniformity in a linguistic signal-or lack thereof-should be measured, and over which linguistic unit, e.g., the sentence or language level, this uniformity should hold. Here we investigate these facets of the UID hypothesis using reading time and acceptability data. While our reading time results are generally consistent with previous work, they are also consistent with a weakly super-linear effect of surprisal, which would be compatible with UID's predictions. For acceptability judgments, we find clearer evidence that non-uniformity in information density is predictive of lower acceptability. We then explore multiple operationalizations of UID, motivated by different interpretations of the original hypothesis, and analyze the scope over which the pressure towards uniformity is exerted. The explanatory power of a subset of the proposed operationalizations suggests that the strongest trend may be a regression towards a mean surprisal across the language, rather than the phrase, sentence, or document-a finding that supports a typical interpretation of UID, namely that it is the byproduct of language users maximizing the use of a (hypothetical) communication channel. 1

show abstract

Section: Discussioncontrasting

confidence: 73%

Revisiting the Uniform Information Density Hypothesis

Meister¹,

Pimentel²,

Haller³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Perplexity can evaluate the fluency of sentences, but still not capable of detecting semantic difference between sentences. Also, recent study (Kuribayashi et al, 2021) shows that low perplexity does not directly refer to a humanlike sentence. Therefore, we should consider again how to evaluate subtle text difference like semantic shift caused by an edition on the text.…”

Section: Related Workmentioning

confidence: 99%

A Novel Metric for Evaluating Semantics Preservation

Peng¹,

Li²,

Zhao³

2021

Preprint

View full text Add to dashboard Cite

In this paper, we leverage pre-trained language models (PLMs) to precisely evaluate the semantics preservation of edition process on sentences. Our metric, Neighbor Distribution Divergence (NDD), evaluates the disturbance on predicted distribution of neighboring words from mask language model (MLM). NDD is capable of detecting precise changes in semantics which are easily ignored by text similarity. By exploiting the property of NDD, we implement a unsupervised and even trainingfree algorithm for extractive sentence compression. We show that our NDD-based algorithm outperforms previous perplexity-based unsupervised algorithm by a large margin. For further exploration on interpretability, we evaluate NDD by pruning on syntactic dependency treebanks and apply NDD for predicate detection as well.

show abstract

“…A participant noted they "wouldn't trust any sort of automatic measure of a text generation system [as they need] more than just a good BLEU or ROUGE score before [they'd] sign off on using a language model" [P11], while others questioned whether automatic metrics "capture anything meaningful" [P13] when assessing latent constructs like creativity. Despite these and other documented shortcomings (Gkatzia and Mahamood, 2015;Novikova et al, 2017;Kuribayashi et al, 2021;Liang and Li, 2021), practitioners do rely broadly on automatic metrics: 50% of survey participants agree or strongly agree that automatic metrics represent reliable ways to assess NLG systems or models [SQ20], while 43% say that metrics developed for one NLG task can be reliably used or adapted to evaluate other NLG tasks (32% academic, 53% non-academic) [SQ22]. One participant remarked that "automatic metric[s are] still more scalable and objective than human evaluation" [SP].…”

Section: Rationales For Evaluation Practicesmentioning

confidence: 99%

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Zhou¹,

Blodgett²,

Trischler³

et al. 2022

Preprint

View full text Add to dashboard Cite

There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraintswhich inform decisions about what, when, and how to evaluate-are often partially or implicitly stated, or not stated at all. Combining a formative semi-structured interview study of NLG practitioners (N=18) with a survey study of a broader sample of practitioners (N=61), we surface goals, community practices, assumptions, and constraints that shape NLG evaluations, examining their implications and how they embody ethical considerations.

show abstract

Lower Perplexity is Not Always Human-Like

Cited by 29 publications

References 48 publications

Revisiting the Uniform Information Density Hypothesis

Revisiting the Uniform Information Density Hypothesis

A Novel Metric for Evaluating Semantics Preservation

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Contact Info

Product

Resources

About