2021
DOI: 10.1609/aaai.v35i15.17605
|View full text |Cite
|
Sign up to set email alerts
|

The Heads Hypothesis: A Unifying Statistical Approach Towards Understanding Multi-Headed Attention in BERT

Abstract: Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores acr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 24 publications
(48 reference statements)
0
1
0
Order By: Relevance
“…Our findings on the heads' roles align with several related studies. The results on the COLA-style and BLIMP benchmarks indicate that (i) a single head can perform multiple linguistic functions (Pande et al, 2021), (ii) some linguistic phenomena, e.g., phrasal movement and island effects, are better captured by head ensembles rather than one head (Htut et al, 2019), and (iii) heads within the same or nearby layers extract similar grammatical phenomena (Bian et al, 2021).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Our findings on the heads' roles align with several related studies. The results on the COLA-style and BLIMP benchmarks indicate that (i) a single head can perform multiple linguistic functions (Pande et al, 2021), (ii) some linguistic phenomena, e.g., phrasal movement and island effects, are better captured by head ensembles rather than one head (Htut et al, 2019), and (iii) heads within the same or nearby layers extract similar grammatical phenomena (Bian et al, 2021).…”
Section: Discussionmentioning
confidence: 99%
“…Prior work has demonstrated that heads induce grammar formalisms and structural knowledge (Zhou and Zhao, 2019;Luo, 2021), and linguistic features motivate attention patterns (Kovaleva et al, 2019;Clark et al, 2019). Recent studies also show that certain heads can have multiple functional roles (Pande et al, 2021) and even perform syntactic functions for typologically distant languages (Ravishankar et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…In essence, based on the results presented in Tables 2, 3 and 4 we conclude that using information from the attention mechanism is helpful in creating augmented samples. At the same time, a detailed The existence of unintuitive attention heads has been already observed in the literature [4,13]. The first paper, besides other topics, studies the socalled vertical attention heads that attend mostly to dots, comas and BERT special tokens.…”
Section: Ablation Studymentioning
confidence: 90%