Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.405
|View full text |Cite
|
Sign up to set email alerts
|

Lower Perplexity is Not Always Human-Like

Abstract: In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
26
3

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3
1

Relationship

1
9

Authors

Journals

citations
Cited by 29 publications
(31 citation statements)
references
References 48 publications
2
26
3
Order By: Relevance
“…3). called into question (Kuribayashi et al, 2021). As such, while we find convincing preliminary evidence in our analyzed languages, we are not able to fully test the hypothesis that the pressure for UID is at the language-level.…”
Section: Discussioncontrasting
confidence: 73%
“…3). called into question (Kuribayashi et al, 2021). As such, while we find convincing preliminary evidence in our analyzed languages, we are not able to fully test the hypothesis that the pressure for UID is at the language-level.…”
Section: Discussioncontrasting
confidence: 73%
“…Perplexity can evaluate the fluency of sentences, but still not capable of detecting semantic difference between sentences. Also, recent study (Kuribayashi et al, 2021) shows that low perplexity does not directly refer to a humanlike sentence. Therefore, we should consider again how to evaluate subtle text difference like semantic shift caused by an edition on the text.…”
Section: Related Workmentioning
confidence: 99%
“…A participant noted they "wouldn't trust any sort of automatic measure of a text generation system [as they need] more than just a good BLEU or ROUGE score before [they'd] sign off on using a language model" [P11], while others questioned whether automatic metrics "capture anything meaningful" [P13] when assessing latent constructs like creativity. Despite these and other documented shortcomings (Gkatzia and Mahamood, 2015;Novikova et al, 2017;Kuribayashi et al, 2021;Liang and Li, 2021), practitioners do rely broadly on automatic metrics: 50% of survey participants agree or strongly agree that automatic metrics represent reliable ways to assess NLG systems or models [SQ20], while 43% say that metrics developed for one NLG task can be reliably used or adapted to evaluate other NLG tasks (32% academic, 53% non-academic) [SQ22]. One participant remarked that "automatic metric[s are] still more scalable and objective than human evaluation" [SP].…”
Section: Rationales For Evaluation Practicesmentioning
confidence: 99%