Anna Rumshisky scite author profile

Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.

show abstract

Revealing the Dark Secrets of BERT

Kovaleva¹,

Romanov²,

Rogers³

et al. 2019

389

367

View full text Add to dashboard Cite

BERT-based architectures currently give stateof-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of selfattention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT's heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.

show abstract

Evaluating temporal relations in clinical text: 2012 i2b2 Challenge

Sun¹,

Rumshisky²,

Uzuner³

2013

J Am Med Inform Assoc

369

329

View full text Add to dashboard Cite

For event detection, statistical machine learning (ML) methods consistently showed superior performance. While ML and rule based methods seemed to detect temporal expressions equally well, the best systems overwhelmingly adopted a rule based approach for value normalization. For temporal relation classification, the systems using hybrid approaches that combined ML and heuristics based methods produced the best results.

show abstract

Predicting early psychiatric readmission with natural language processing of narrative discharge summaries

Rumshisky¹,

Ghassemi²,

Naumann³

et al. 2016

Transl Psychiatry

140

121

View full text Add to dashboard Cite

The ability to predict psychiatric readmission would facilitate the development of interventions to reduce this risk, a major driver of psychiatric health-care costs. The symptoms or characteristics of illness course necessary to develop reliable predictors are not available in coded billing data, but may be present in narrative electronic health record (EHR) discharge summaries. We identified a cohort of individuals admitted to a psychiatric inpatient unit between 1994 and 2012 with a principal diagnosis of major depressive disorder, and extracted inpatient psychiatric discharge narrative notes. Using these data, we trained a 75-topic Latent Dirichlet Allocation (LDA) model, a form of natural language processing, which identifies groups of words associated with topics discussed in a document collection. The cohort was randomly split to derive a training (70%) and testing (30%) data set, and we trained separate support vector machine models for baseline clinical features alone, baseline features plus common individual words and the above plus topics identified from the 75-topic LDA model. Of 4687 patients with inpatient discharge summaries, 470 were readmitted within 30 days. The 75-topic LDA model included topics linked to psychiatric symptoms (suicide, severe depression, anxiety, trauma, eating/weight and panic) and major depressive disorder comorbidities (infection, postpartum, brain tumor, diarrhea and pulmonary disease). By including LDA topics, prediction of readmission, as measured by area under receiver-operating characteristic curves in the testing data set, was improved from baseline (area under the curve 0.618) to baseline+1000 words (0.682) to baseline+75 topics (0.784). Inclusion of topics derived from narrative notes allows more accurate discrimination of individuals at high risk for psychiatric readmission in this cohort. Topic modeling and related approaches offer the potential to improve prediction using EHRs, if generalizability can be established in other clinical cohorts.

show abstract

When BERT Plays the Lottery, All Tickets Are Winning

Prasanna¹,

Rogers²,

Rumshisky³

2020

View full text Add to dashboard Cite

Large Transformer-based models were shown to be reducible to a smaller number of selfattention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

show abstract

SemEval-2017 Task 6: #HashtagWars: Learning a Sense of Humor

Potash¹,

Romanov²,

Rumshisky³

2017

View full text Add to dashboard Cite

This paper describes a new shared task for humor understanding that attempts to eschew the ubiquitous binary approach to humor detection and focus on comparative humor ranking instead. The task is based on a new dataset of funny tweets posted in response to shared hashtags, collected from the 'Hashtag Wars' segment of the TV show @midnight. The results are evaluated in two subtasks that require the participants to generate either the correct pairwise comparisons of tweets (subtask A), or the correct ranking of the tweets (subtask B) in terms of how funny they are. 7 teams participated in subtask A, and 5 teams participated in subtask B. The best accuracy in subtask A was 0.675. The best (lowest) rank edit distance for subtask B was 0.872.

show abstract

Automating temporal annotation with TARSQI

et al. 2005

View full text Add to dashboard Cite

show abstract

A Primer in BERTology: What we know about how BERT works

Rogers¹,

Ковалева²,

Rumshisky³

2020

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Anna Rumshisky

A Primer in BERTology: What We Know About How BERT Works

Revealing the Dark Secrets of BERT

Evaluating temporal relations in clinical text: 2012 i2b2 Challenge

Predicting early psychiatric readmission with natural language processing of narrative discharge summaries

When BERT Plays the Lottery, All Tickets Are Winning

SemEval-2017 Task 6: #HashtagWars: Learning a Sense of Humor

Automating temporal annotation with TARSQI

A Primer in BERTology: What we know about how BERT works

Contact Info

Product

Resources

About