Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1337
|View full text |Cite
|
Sign up to set email alerts
|

Scalable Syntax-Aware Language Models Using Knowledge Distillation

Abstract: Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders scaling difficult, and it remains an open question whether structural biases are still necessary when sequential models have access to ever larger amounts of training data. To answer this question, we introduce an efficient knowledge distillation (KD) technique that transfers… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
38
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(43 citation statements)
references
References 42 publications
(55 reference statements)
2
38
0
Order By: Relevance
“…Another relevant work on the capacity of LSTM-LMs is Kuncoro et al (2019), which shows that by distilling from syntactic LMs (Dyer et al, 2016), LSTM-LMs can improve their robustness on various agreement phenomena. We show that our LMs with the margin loss outperform theirs in most of the aspects, further strengthening the argument about a stronger capacity of LSTM-LMs.…”
Section: Past Work Conceptually Similar To Us Ismentioning
confidence: 99%
See 2 more Smart Citations
“…Another relevant work on the capacity of LSTM-LMs is Kuncoro et al (2019), which shows that by distilling from syntactic LMs (Dyer et al, 2016), LSTM-LMs can improve their robustness on various agreement phenomena. We show that our LMs with the margin loss outperform theirs in most of the aspects, further strengthening the argument about a stronger capacity of LSTM-LMs.…”
Section: Past Work Conceptually Similar To Us Ismentioning
confidence: 99%
“…Training data Following the practice, we train LMs on the dataset not directly relevant to the test set. Throughout the paper, we use an English Wikipedia corpus assembled by Gulordava et al (2018), which has been used as training data for the present task (Marvin and Linzen, 2018;Kuncoro et al, 2019), consisting of 80M/10M/10M tokens for training/dev/test sets. It is tokenized and rare words are replaced by a single unknown token, amounting to the vocabulary size of 50,000.…”
Section: Language Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Our work is also closely related to Kuncoro et al, (2019), who distill syntactic structure knowledge to a student LSTM model. The difference lies in that they focus on transferring tree knowledge from syntax-aware language model for achieving scalable unsupervised syntax induction, while we aim at integrating heterogeneous syntax for improving downstream tasks.…”
Section: Knowledge Distillationmentioning
confidence: 98%
“…Sequential models have been proven effective on encoding syntactic tree information (Shen et al, 2018;Kuncoro et al, 2019). We set the goal of KD as simultaneously distilling heterogeneous structures from tree encoder teachers into a LSTM student model.…”
Section: Heterogeneous Structure Distillationmentioning
confidence: 99%