Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 2018
DOI: 10.18653/v1/w18-5444
|View full text |Cite
|
Sign up to set email alerts
|

Extracting Syntactic Trees from Transformer Encoder Self-Attentions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
30
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(32 citation statements)
references
References 7 publications
1
30
0
Order By: Relevance
“…The representations generated by NMT models may be thought of as contextualized word representations (CWRs), as they capture context via the NMT encoder or decoder. We have already mentioned one work exploiting this idea, known as CoVE (McCann et al 2017), which used NMT representations as features in other models to perform various NLP tasks. Other prominent contextualizers include ELMo (Peters et al 2018a), which trains two separate, forward and backward LSTM language models (with a character CNN building block) and concatenates their representations across several layers; GPT (Radford et al 2018) and GPT-2 (Radford et al 2019), which use transformer language models based on self-attention (Vaswani et al 2017); and BERT (Devlin et al 2019), which uses a bidirectional transformer model trained on masked language modeling (filling the blanks).…”
Section: Contextualized Word Representationsmentioning
confidence: 99%
See 1 more Smart Citation
“…The representations generated by NMT models may be thought of as contextualized word representations (CWRs), as they capture context via the NMT encoder or decoder. We have already mentioned one work exploiting this idea, known as CoVE (McCann et al 2017), which used NMT representations as features in other models to perform various NLP tasks. Other prominent contextualizers include ELMo (Peters et al 2018a), which trains two separate, forward and backward LSTM language models (with a character CNN building block) and concatenates their representations across several layers; GPT (Radford et al 2018) and GPT-2 (Radford et al 2019), which use transformer language models based on self-attention (Vaswani et al 2017); and BERT (Devlin et al 2019), which uses a bidirectional transformer model trained on masked language modeling (filling the blanks).…”
Section: Contextualized Word Representationsmentioning
confidence: 99%
“…The generalization of the particular results in this work to other architectures is a question of study. Recent efforts to analyze transformer-based NMT models include attempts to extract syntactic trees from self-attention weights (Mareček and Rosa 2018;Raganato and Tiedemann 2018) and evaluating representations from the transformer encoder (Raganato and Tiedemann 2018). The latter found that lower layers tend to focus on POS and shallow syntax, whereas higher layers are more focused on semantic tagging.…”
Section: Other Nmt Architecturesmentioning
confidence: 99%
“…The original Transformer paper (Vaswani et al, 2017) shows attention visualizations, from which some speculation can be made of the roles the several attention heads have. Mareček and Rosa (2018) study the syntactic abilities of the Transformer self-attention, while Raganato and Tiedemann (2018) extract dependency relations from the attention weights. Tenney et al (2019) find that the self-attentions in BERT (Devlin et al, 2019) follow a sequence of processes that resembles a classical NLP pipeline.…”
Section: Related Workmentioning
confidence: 99%
“…lie multi-head attention mechanisms: each word is represented by multiple different weighted averages of its relevant context. As suggested by recent works on interpreting attention head roles, separate attention heads may learn to look for various relationships between tokens (Tang et al, 2018;Raganato and Tiedemann, 2018;Mareček and Rosa, 2018;Tenney et al, 2019;Voita et al, 2019). The attention distribution of each head is predicted typically using the softmax normalizing transform.…”
Section: Introductionmentioning
confidence: 99%
“…Transformers (Devlin et al, 2018;Liu et al, 2019; seem to capture syntactic relations among words by "focusing the attention". Yet, to be sure that syntax is encoded, many syntactic probes (Conneau et al, 2018) for neural networks have been designed to test for specific phenomena (Kovaleva et al, 2019;Jawahar et al, 2019;Hewitt and Manning, 2019;Ettinger, 2019;Goldberg, 2019) or for full syntactic trees (Hewitt and Manning, 2019;Mareček and Rosa, 2019). Indeed, some syntax is correctly encoded in these universal sentence embeddings.…”
Section: Introductionmentioning
confidence: 99%