Consistency of a Recurrent Language Model With Respect to Incomplete Decoding

Welleck, Sean; Kulikov, Ilia; Kim, Jaedeok; Pang, Richard Yuanzhe; Cho, Kyunghyun

doi:10.18653/v1/2020.emnlp-main.448

Cited by 29 publications

(27 citation statements)

References 18 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Overall, this fine-tuning strategy is able to generate explanations that follow a style similar to the reference explanation. However, we identify cases where the model generates gibberish and/or repetitive text, which are problems previously reported in the literature while using GPT-2 (Holtzman et al, 2019;Welleck et al, 2020). To address these issues, we devise a strategy to remove unimportant sentences that could introduce noise to the generation process.…”

Section: Abstractive: Gpt-2 Basedmentioning

confidence: 88%

Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News

Kazemi¹,

Li²,

Pérez-Rosas³

et al. 2021

Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

View full text Add to dashboard Cite

In this paper, we explore the construction of natural language explanations for news claims, with the goal of assisting fact-checking and news evaluation applications. We experiment with two methods: (1) an extractive method based on Biased TextRank -a resource-effective unsupervised graph-based algorithm for content extraction; and (2) an abstractive method based on the GPT-2 language model. We perform comparative evaluations on two misinformation datasets in the political and health news domains, and find that the extractive method shows the most promise.

show abstract

Section: Abstractive: Gpt-2 Basedmentioning

confidence: 88%

Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News

Kazemi¹,

Li²,

Pérez-Rosas³

et al. 2021

Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

View full text Add to dashboard Cite

show abstract

“…In this case, the factors ( |x) defined by the autoregressive model are not actually the conditional probabilities of the weighted language (as defined by §2.1). It is true that training with a likelihood objective does encourage finding a weighted language whose generative process always terminates (hence = 1), since this is the behavior observed in the training corpus (Chi and Geman, 1998;Chen et al, 2018;Welleck et al, 2020). Our definitions of ELN(CP) models require the actual conditional probabilities to be efficiently computable.…”

Section: Eln and Elncp Modelsmentioning

confidence: 99%

Limitations of Autoregressive Models and Their Alternatives

Lin

Jaech

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length.Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.

show abstract

“…Concretely, we consider two decoding approaches: a deterministic decoding algorithm that produces a set of sequences using beam search with beam-width k, and a stochastic decoding algorithm that forms a set of sequences using ancestral sampling until k unique sequences are obtained. 1 We refer readers to Welleck et al (2020a) for detailed descriptions of those decoding algorithms.…”

Section: Neural Autoregressive Sequence Modelingmentioning

confidence: 99%

“…However, recent studies suggest that the most likely sequences may not resemble training sequences at all. For instance, the learning stage can yield a distribution p model which places high probability on empty (Stahlberg and Byrne, 2019) or repetitive (Holtzman et al, 2019) sequences, while the decoding stage can yield a distribution p F which places non-zero mass on infinite-length sequences (Welleck et al, 2020a).…”

Section: Mode Recoverymentioning

confidence: 99%

Mode recovery in neural autoregressive sequence modeling

Kulikov¹,

Welleck²,

Cho³

2021

Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021)

Self Cite

View full text Add to dashboard Cite

Despite its wide use, recent studies have revealed unexpected and undesirable properties of neural autoregressive sequence models trained with maximum likelihood, such as an unreasonably high affinity to short sequences after training and to infinitely long sequences at decoding time. We propose to study these phenomena by investigating how the modes, or local maxima, of a distribution are maintained throughout the full learning chain of the ground-truth, empirical, learned and decodinginduced distributions, via the newly proposed mode recovery cost. We design a tractable testbed where we build three types of groundtruth distributions: (1) an LSTM based structured distribution, (2) an unstructured distribution where probability of a sequence does not depend on its content, and (3) a product of these two which we call a semi-structured distribution. Our study reveals both expected and unexpected findings. First, starting with data collection, mode recovery cost strongly relies on the ground-truth distribution and is most costly with the semi-structured distribution. Second, after learning, mode recovery cost from the ground-truth distribution may increase or decrease compared to data collection, with the largest cost degradation occurring with the semi-structured ground-truth distribution. Finally, the ability of the decodinginduced distribution to recover modes from the learned distribution is highly impacted by the choices made earlier in the learning chain. We conclude that future research must consider the entire learning chain in order to fully understand the potentials and perils and to further improve neural autoregressive sequence models.

show abstract

Consistency of a Recurrent Language Model With Respect to Incomplete Decoding

Cited by 29 publications

References 18 publications

Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News

Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News

Limitations of Autoregressive Models and Their Alternatives

Mode recovery in neural autoregressive sequence modeling

Contact Info

Product

Resources

About