Kyle Lo scite author profile

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data.SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-theart results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

show abstract

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Gururangan¹,

Marasović²,

Swayamdipta³

et al. 2020

894

423

View full text Add to dashboard Cite

Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining indomain (domain-adaptive pretraining) leads to performance gains, under both high-and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multiphase adaptive pretraining offers large gains in task performance.

show abstract

SciBERT: A Pretrained Language Model for Scientific Text

Beltagy¹,

Lo²,

Cohan³

2019

Preprint

176

255

View full text Add to dashboard Cite

Fact or Fiction: Verifying Scientific Claims

Wadden¹,

Lin²,

Lo³

et al. 2020

166

213

View full text Add to dashboard Cite

We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUP-PORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SCI-FACT, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SCIFACT, and demonstrate that simple domain adaptation techniques substantially improve performance compared to models trained on Wikipedia or political news. We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus. Our experiments indicate that SCIFACT will provide a challenging testbed for the development of new systems designed to retrieve and reason over corpora containing specialized domain knowledge. Data and code for this new task are publicly available at https:// github.com/allenai/scifact. A leaderboard and COVID-19 fact-checking demo are available at https://scifact.apps. allenai.org. * Work performed during internship with the Allen Institute for Artificial Intelligence.More severe COVID-19 infection is associated with higher mean troponin (SMD 0.53, 95% CI 0.30 to 0.75, p < 0.001) Decision: SUPPORTS Claim Fact-checker Rationale CorpusCardiac injury is common in critical cases of COVID-19.Claim 1: Lopinavir / ritonavir have exhibited favorable clinical responses when used as a treatment for coronavirus. Supports: . . . Interestingly, after lopinavir/ritonavir (Kaletra, AbbVie) was administered, β-coronavirus viral loads significantly decreased and no or little coronavirus titers were observed. Refutes:The focused drug repurposing of known approved drugs (such as lopinavir/ritonavir) has been reported failed for curing SARS-CoV-2 infected patients. It is urgent to generate new chemical entities against this virus . . . Claim 2:The coronavirus cannot thrive in warmer climates. Supports: ...most outbreaks display a pattern of clustering in relatively cool and dry areas...This is because the environment can mediate human-to-human transmission of SARS-CoV-2, and unsuitable climates can cause the virus to destabilize quickly... Refutes: ...significant cases in the coming months are likely to occur in more humid (warmer) climates, irrespective of the climate-dependence of transmission and that summer temperatures will not substrantially limit pandemic growth.

show abstract

Construction of the Literature Graph in Semantic Scholar

Ammar¹,

Groeneveld²,

Bhagavatula³

et al. 2018

259

204

View full text Add to dashboard Cite

We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in www.semanticscholar.org.

show abstract

S2ORC: The Semantic Scholar Open Research Corpus

Lo¹,

Wang²,

Neumann³

et al. 2020

229

187

View full text Add to dashboard Cite

We introduce S2ORC, 1 a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automaticallydetected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. We hope this resource will facilitate research and development of tools and tasks for text mining over academic text.

show abstract

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Gururangan¹,

Marasović²,

Swayamdipta³

et al. 2020

Preprint

141

View full text Add to dashboard Cite

TLDR: Extreme Summarization of Scientific Documents

Cachola¹,

Lo²,

Cohan³

et al. 2020

View full text Add to dashboard Cite

We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language. To facilitate study on this task, we introduce SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SCITLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden. We propose CATTS, a simple yet effective learning strategy for generating TLDRs that exploits titles as an auxiliary training signal. CATTS improves upon strong baselines under both automated metrics and human evaluations. Data and code are publicly available at https://github.com/allenai/scitldr.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kyle Lo

SciBERT: A Pretrained Language Model for Scientific Text

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

SciBERT: A Pretrained Language Model for Scientific Text

Fact or Fiction: Verifying Scientific Claims

Construction of the Literature Graph in Semantic Scholar

S2ORC: The Semantic Scholar Open Research Corpus

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

TLDR: Extreme Summarization of Scientific Documents

Contact Info

Product

Resources

About