Given the success of Transformer-based models, two directions of study have emerged: interpreting role of individual attention heads and down-sizing the models for efficiency. Our work straddles these two streams: We analyse the importance of basing pruning strategies on the interpreted role of the attention heads. We evaluate this on Transformer and BERT models on multiple NLP tasks. Firstly, we find that a large fraction of the attention heads can be randomly pruned with limited effect on accuracy. Secondly, for Transformers, we find no advantage in pruning attention heads identified to be important based on existing studies that relate importance to the location of a head. On the BERT model too we find no preference for top or bottom layers, though the latter are reported to have higher importance. However, strategies that avoid pruning middle layers and consecutive layers perform better. Finally, during fine-tuning the compensation for pruned attention heads is roughly equally distributed across the un-pruned heads. Our results thus suggest that interpretation of attention heads does not strongly inform pruning.
Multi-headed attention heads are a mainstay in transformerbased models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically analyze attention heads and confidently comment on many commonly posed questions on analyzing the BERT model. In particular, we comment on the co-location of multiple functional roles in the same attention head, the distribution of attention heads across layers, and effect of fine-tuning for specific NLP tasks on these functional roles. Code is made publicly available. 1
Introduction Hispanic Americans receive disproportionately fewer organ transplants than non-Hispanic whites. In 2018, the Hispanic Kidney Transplant Program (HKTP) was established as at the University of Colorado Hospital (UCH). The purpose of this quality improvement study was to examine the effect of this culturally sensitive program in reducing disparities in kidney transplantation. Methods We performed a mixed-methods analysis of data from 436 Spanish-speaking patients referred for transplant to UCH between 2015 and 2020. We compared outcomes for patients referred between 2015–2017 (n = 156) to those referred between 2018–2020 (n = 280). Semi-structured phone interviews were conducted with 6 patients per time period and with 6 nephrology providers in the Denver Metro Area. Patients and providers were asked to evaluate communication, transplant education, and overall experience. Results When comparing the two time periods, there was a significant increase in the percentage of patients being referred (79.5% increase, p-0.008) and evaluated for transplant (82.4% increase, p = 0.02) during 2018–2020. While the number of committee reviews and number waitlisted increased during 2018–2020, it did not reach statistical significance (82.9% increase, p = 0.37 and 79.5% increase, p = 0.75, respectively. During patient and provider interviews, we identified 4 themes reflecting participation in the HKTP: improved communication, enhanced patient education, improved experience and areas for advancement. Overall, patients and providers reported a positive experience with the HKTP and noted improved patient understanding of the transplantation process. Conclusions The establishment of the HKTP is associated with a significant increase in Spanish-speaking Hispanic patients being referred and evaluated for kidney transplantation.
Large multilingual models, such as mBERT, have shown promise in crosslingual transfer. In this work, we employ pruning to quantify the robustness and interpret layer-wise importance of mBERT. On four GLUE tasks, the relative drops in accuracy due to pruning have almost identical results on mBERT and BERT suggesting that the reduced attention capacity of the multilingual models does not affect robustness to pruning. For the crosslingual task XNLI, we report higher drops in accuracy with pruning indicating lower robustness in crosslingual transfer. Also, the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size. The top layers, which are relatively more influenced by fine-tuning, encode important information for languages similar to English (SVO) while the bottom layers, which are relatively less influenced by fine-tuning, are particularly important for agglutinative and low-resource languages.
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically analyze attention heads and confidently comment on many commonly posed questions on analyzing the BERT model. In particular, we comment on the co-location of multiple functional roles in the same attention head, the distribution of attention heads across layers, and effect of fine-tuning for specific NLP tasks on these functional roles. The code is made publicly available at https://github.com/iitmnlp/heads-hypothesis
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.