Language Model Evaluation Beyond Perplexity

Meister, Clara; Cotterell, Ryan

doi:10.18653/v1/2021.acl-long.414

Cited by 25 publications

(15 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As neural networks yield stateof-the-art performance in language modeling tasks, we expect them to also do well with the unigram distribution. In fact, pseudo-text generated by LSTM-based language models reproduces Zipf's law to some extent (Takahashi and Tanaka-Ishii, 2017;Meister and Cotterell, 2021). Thus, we view state-of-the-art LSTM models as a strong baseline.…”

Section: Modeling the Unigram Distributionmentioning

confidence: 99%

Modeling the Unigram Distribution

Nikkarinen¹,

Pimentel²,

Blasí³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution-claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the naïve use of neural character-level language models.

show abstract

Section: Modeling the Unigram Distributionmentioning

confidence: 99%

Modeling the Unigram Distribution

Nikkarinen¹,

Pimentel²,

Blasí³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this section, we examine the extent to which higher-order statistics of sentences from BERT's prior are well-calibrated to the data it was trained on. This kind of comparison provides a richer sense of what the model has learned or failed to learn than traditional scalar metrics like perplexity (Takahashi and Tanaka-Ishii, 2017; Meister and Cotterell, 2021;Takahashi and Tanaka-Ishii, 2019).…”

Section: Distributional Comparisonsmentioning

confidence: 99%

Probing BERT's priors with serial reproduction chains

Yamakoshi¹,

Griffiths²,

Hawkins³

2022

Preprint

View full text Add to dashboard Cite

We can learn as much about language models from what they say as we learn from their performance on targeted benchmarks. Sampling is a promising bottom-up method for probing, but generating samples from successful models like BERT remains challenging. Taking inspiration from theories of iterated learning in cognitive science, we explore the use of serial reproduction chains to probe BERT's priors. Although the masked language modeling objective does not guarantee a consistent joint distribution, we observe that a unique and consistent estimator of the ground-truth joint distribution may be obtained by a GSN sampler, which randomly selects which word to mask and reconstruct on each step. We compare the lexical and syntactic statistics of sentences from the resulting prior distribution against those of the ground-truth corpus distribution and elicit a large empirical sample of naturalness judgments to investigate how, exactly, the model deviates from human speakers. Our findings suggest the need to move beyond top-down evaluation methods toward bottomup probing to capture the full richness of what has been learned about language.

show abstract

“…ing frameworks (Meister & Cotterell, 2021) to better understand whether the large-scale statistical tendencies of natural language, such as Zipf's law (Zipf, 1949), are captured by LMs. We take a more fine-grained approach, proposing a methodology which draws off of instance-level evaluation schemes (Zhong et al, 2021) and the experimental control afforded by artificial corpora (White & Cotterell, 2021;Papadimitriou & Jurafsky, 2020).…”

Section: Related Workmentioning

confidence: 99%

“…Recently, a growing body work has sought to understand how these language models (LM) fit the distribution of a language beyond standard measures such as perplexity. Meister & Cotterell (2021), for example, investigated the statistical tendencies of the distribution defined by neural LMs, whereas Kulikov et al (2021) explored whether they adequately capture the modes of the distribution they attempt to model. At the same time, increased focus has been given to performance on rare or novel events in the data distribution, both for models of natural language (McCoy et al, 2021;Lent et al, 2021;Dudy & Bedrick, 2020;Oren et al, 2019) and neural models more generally (see, for example Sagawa et al, 2020;D'souza et al, 2021;Blevins & Zettlemoyer, 2020;Czarnowska et al, 2019;Horn & Perona, 2017;Ouyang et al, 2016;Bengio, 2015;Zhu et al, 2014).…”

Section: Introductionmentioning

confidence: 99%

Evaluating Distributional Distortion in Neural Language Modeling

LeBrun¹,

Sordoni²,

O’Donnell³

2022

Preprint

View full text Add to dashboard Cite

A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for lessprobable sequences. Investigating where this probability mass went, (iii) we find that LMs tend to overestimate the probability of ill-formed (perturbed) sequences. In addition, we find that this underestimation behaviour (iv) is weakened, but not eliminated by greater amounts of training data, and (v) is exacerbated for target distributions with lower entropy.

show abstract

Language Model Evaluation Beyond Perplexity

Cited by 25 publications

References 34 publications

Modeling the Unigram Distribution

Modeling the Unigram Distribution

Probing BERT's priors with serial reproduction chains

Evaluating Distributional Distortion in Neural Language Modeling

Contact Info

Product

Resources

About