Extracting Syntactic Trees from Transformer Encoder Self-Attentions

Mareċek, David; Rosa, Rudolf

doi:10.18653/v1/w18-5444

Cited by 34 publications

(32 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The representations generated by NMT models may be thought of as contextualized word representations (CWRs), as they capture context via the NMT encoder or decoder. We have already mentioned one work exploiting this idea, known as CoVE (McCann et al 2017), which used NMT representations as features in other models to perform various NLP tasks. Other prominent contextualizers include ELMo (Peters et al 2018a), which trains two separate, forward and backward LSTM language models (with a character CNN building block) and concatenates their representations across several layers; GPT (Radford et al 2018) and GPT-2 (Radford et al 2019), which use transformer language models based on self-attention (Vaswani et al 2017); and BERT (Devlin et al 2019), which uses a bidirectional transformer model trained on masked language modeling (filling the blanks).…”

Section: Contextualized Word Representationsmentioning

confidence: 99%

“…The generalization of the particular results in this work to other architectures is a question of study. Recent efforts to analyze transformer-based NMT models include attempts to extract syntactic trees from self-attention weights (Mareček and Rosa 2018;Raganato and Tiedemann 2018) and evaluating representations from the transformer encoder (Raganato and Tiedemann 2018). The latter found that lower layers tend to focus on POS and shallow syntax, whereas higher layers are more focused on semantic tagging.…”

Section: Other Nmt Architecturesmentioning

confidence: 99%

See 1 more Smart Citation

On the Linguistic Representational Power of Neural Machine Translation Models

Belinkov

Durrani²,

Dalvi³

et al. 2020

Computational Linguistics

View full text Add to dashboard Cite

Despite the recent success of deep neural networks in natural language processing (NLP) and other spheres of artificial intelligence (AI), their interpretability remains a challenge. We analyze the representations learned by neural machine translation (NMT) models at various levels of granularity and evaluate their quality through relevant extrinsic properties. In particular, we seek answers to the following questions: (i) How accurately is word-structure captured within the learned representations, which is an important aspect in translating morphologically-rich languages? (ii) Do the representations capture long-range dependencies, and effectively handle syntactically divergent languages? (iii) Do the representations capture lexical semantics? We conduct a thorough investigation along several parameters: (i) Which layers in the architecture capture each of these linguistic phenomena; (ii) How does the choice of translation unit (word, character, or subword unit) impact the linguistic properties captured by the underlying representations? (iii) Do the encoder and decoder learn differently and independently? (iv) Do the representations learned by multilingual NMT models capture the same amount of linguistic information as their bilingual counterparts? Our data-driven, quantitative evaluation illuminates important aspects in NMT models and their ability to capture various linguistic phenomena. We show that deep NMT models trained in an end-to-end fashion, without being provided any direct supervision during the training process, learn a non-trivial amount of linguistic information. Notable findings include the following observations: i) Word morphology and part-of-speech information are captured at the lower layers of the model; (ii) In contrast, lexical semantics or non-local syntactic and semantic dependencies are better represented at the higher layers of the model; (iii) Representations learned using characters are more informed about wordmorphology compared to those learned using subword units; and (iv) Representations learned by multilingual models are richer compared to bilingual models.

show abstract

Section: Contextualized Word Representationsmentioning

confidence: 99%

Section: Other Nmt Architecturesmentioning

confidence: 99%

On the Linguistic Representational Power of Neural Machine Translation Models

Belinkov

Durrani²,

Dalvi³

et al. 2020

Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…The original Transformer paper (Vaswani et al, 2017) shows attention visualizations, from which some speculation can be made of the roles the several attention heads have. Mareček and Rosa (2018) study the syntactic abilities of the Transformer self-attention, while Raganato and Tiedemann (2018) extract dependency relations from the attention weights. Tenney et al (2019) find that the self-attentions in BERT (Devlin et al, 2019) follow a sequence of processes that resembles a classical NLP pipeline.…”

Section: Related Workmentioning

confidence: 99%

“…lie multi-head attention mechanisms: each word is represented by multiple different weighted averages of its relevant context. As suggested by recent works on interpreting attention head roles, separate attention heads may learn to look for various relationships between tokens (Tang et al, 2018;Raganato and Tiedemann, 2018;Mareček and Rosa, 2018;Tenney et al, 2019;Voita et al, 2019). The attention distribution of each head is predicted typically using the softmax normalizing transform.…”

Section: Introductionmentioning

confidence: 99%

Adaptively Sparse Transformers

Correia¹,

Niculae²,

Martins³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

116

View full text Add to dashboard Cite

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multiheaded attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, contextdependent sparsity patterns. This sparsity is accomplished by replacing softmax with αentmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the α parameter -which controls the shape and sparsity of αentmax -allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.

show abstract

“…Transformers (Devlin et al, 2018;Liu et al, 2019; seem to capture syntactic relations among words by "focusing the attention". Yet, to be sure that syntax is encoded, many syntactic probes (Conneau et al, 2018) for neural networks have been designed to test for specific phenomena (Kovaleva et al, 2019;Jawahar et al, 2019;Hewitt and Manning, 2019;Ettinger, 2019;Goldberg, 2019) or for full syntactic trees (Hewitt and Manning, 2019;Mareček and Rosa, 2019). Indeed, some syntax is correctly encoded in these universal sentence embeddings.…”

Section: Introductionmentioning

confidence: 99%

KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations

Zanzotto¹,

Santilli²,

Ranaldi³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Syntactic parsers have dominated natural language understanding for decades. Yet, their syntactic interpretations are losing centrality in downstream tasks due to the success of large-scale textual representation learners. In this paper, we propose KERMIT (Kernelinspired Encoder with Recursive Mechanism for Interpretable Trees) to embed symbolic syntactic parse trees into artificial neural networks and to visualize how syntax is used in inference. We experimented with KERMIT paired with two state-of-the-art transformerbased universal sentence encoders (BERT and XLNet) and we showed that KERMIT can indeed boost their performance by effectively embedding human-coded universal syntactic representations in neural networks.

show abstract

Extracting Syntactic Trees from Transformer Encoder Self-Attentions

Cited by 34 publications

References 7 publications

On the Linguistic Representational Power of Neural Machine Translation Models

On the Linguistic Representational Power of Neural Machine Translation Models

Adaptively Sparse Transformers

KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations

Contact Info

Product

Resources

About