Quantity doesn’t buy quality syntax with neural language models

Schijndel, Marten van; Mueller, Aaron; Linzen, Tal

doi:10.18653/v1/d19-1592

Cited by 60 publications

(57 citation statements)

References 22 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results address the three questions posed above: First, for the range of model architectures and dataset sizes tested, we find a substantial dissociation between perplexity and SG score. Second, we find a larger effect of model inductive bias than training data size on SG score, a result that accords with van Schijndel et al (2019). Models afforded explicit structural supervision during training outperform other models: One structurally supervised model is able to achieve the same SG scores as a purely sequence-based model trained on ∼100 times the number of tokens.…”

Section: Introductionsupporting

confidence: 76%

A Systematic Assessment of Syntactic Generalization in Neural Language Models

Hu¹,

Gauthier²,

Qian³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

104

View full text Add to dashboard Cite

While state-of-the-art neural network models continue to achieve lower perplexity scores on language modeling benchmarks, it remains unknown whether optimizing for broad-coverage predictive performance leads to human-like syntactic knowledge. Furthermore, existing work has not provided a clear picture about the model properties required to produce proper syntactic generalizations. We present a systematic evaluation of the syntactic knowledge of neural language models, testing 20 combinations of model types and data sizes on a set of 34 English-language syntactic test suites. We find substantial differences in syntactic generalization performance by model architecture, with sequential models underperforming other architectures. Factorially manipulating model architecture and training dataset size (1M-40M words), we find that variability in syntactic generalization performance is substantially greater by architecture than by dataset size for the corpora tested in our experiments. Our results also reveal a dissociation between perplexity and syntactic generalization performance.

show abstract

Section: Introductionsupporting

confidence: 76%

A Systematic Assessment of Syntactic Generalization in Neural Language Models

Hu¹,

Gauthier²,

Qian³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

104

View full text Add to dashboard Cite

show abstract

“…These high performance levels typically come at the cost of decreased interpretability. Such neural nets are notoriously prone to learning irrelevant correlations (Ettinger, 2020; Futrell et al, 2019; Kuncoro et al, 2018; van Schijndel, Mueller, & Linzen, 2019). To avoid this problem and focus our investigation more squarely on structural constraints like locality in Grodner and Gibson (2005) and non‐structural factors such as animacy in Traxler et al (2002), we instead proceed with an explicit grammar whose generalization ability rests upon well‐chosen syntactic analyses.…”

Section: From Grammar To Processing Difficulty Predictionsmentioning

confidence: 99%

Quantifying Structural and Non‐structural Expectations in Relative Clause Processing

Chen

Hale

2021

Cognitive Science

View full text Add to dashboard Cite

Information‐theoretic complexity metrics, such as Surprisal (Hale, 2001; Levy, 2008) and Entropy Reduction (Hale, 2003), are linking hypotheses that bridge theorized expectations about sentences and observed processing difficulty in comprehension. These expectations can be viewed as syntactic derivations constrained by a grammar. However, this expectation‐based view is not limited to syntactic information alone. The present study combines structural and non‐structural information in unified models of word‐by‐word sentence processing difficulty. Using probabilistic minimalist grammars (Stabler, 1997), we extend expectation‐based models to include frequency information about noun phrase animacy. Entropy reductions derived from these grammars faithfully reflect the asymmetry between subject and object relatives (Staub, 2010; Staub, Dillon, & Clifton, 2017), as well as the effect of animacy on the measured difficulty profile (Lowder & Gordon, 2012; Traxler, Morris, & Seely, 2002). Visualizing probability distributions on the remaining alternatives at particular parser states allows us to explore new, linguistically plausible interpretations for the observed processing asymmetries, including the way that expectations about the relativized argument influence the processing of particular types of relative clauses (Wagers & Pendleton, 2016).

show abstract

“…The Transformer allows the attention for a token to be spread over the entire input sequence, multiple times, intuitively capturing different properties. This characteristic has led to a line of research focusing on the interpretation of Transformer-based networks and their attention mechanisms (Raganato and Tiedemann, 2018;Tang et al, 2018;Mareček and Rosa, 2019;Voita et al, 2019a;Vig and Belinkov, 2019;Clark et al, 2019;Kovaleva et al, 2019;Tenney et al, 2019;Lin et al, 2019;Jawahar et al, 2019;van Schijndel et al, 2019;Hao et al, 2019b;Rogers et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Raganato

Scherrer

Tiedemann

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed -non-learnable -attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

show abstract

Quantity doesn’t buy quality syntax with neural language models

Cited by 60 publications

References 22 publications

A Systematic Assessment of Syntactic Generalization in Neural Language Models

A Systematic Assessment of Syntactic Generalization in Neural Language Models

Quantifying Structural and Non‐structural Expectations in Relative Clause Processing

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Contact Info

Product

Resources

About