We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in www.semanticscholar.org.
We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUP-PORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SCI-FACT, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SCIFACT, and demonstrate that simple domain adaptation techniques substantially improve performance compared to models trained on Wikipedia or political news. We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus. Our experiments indicate that SCIFACT will provide a challenging testbed for the development of new systems designed to retrieve and reason over corpora containing specialized domain knowledge. Data and code for this new task are publicly available at https:// github.com/allenai/scifact. A leaderboard and COVID-19 fact-checking demo are available at https://scifact.apps. allenai.org. * Work performed during internship with the Allen Institute for Artificial Intelligence.More severe COVID-19 infection is associated with higher mean troponin (SMD 0.53, 95% CI 0.30 to 0.75, p < 0.001) Decision: SUPPORTS Claim Fact-checker Rationale CorpusCardiac injury is common in critical cases of COVID-19.Claim 1: Lopinavir / ritonavir have exhibited favorable clinical responses when used as a treatment for coronavirus. Supports: . . . Interestingly, after lopinavir/ritonavir (Kaletra, AbbVie) was administered, β-coronavirus viral loads significantly decreased and no or little coronavirus titers were observed. Refutes:The focused drug repurposing of known approved drugs (such as lopinavir/ritonavir) has been reported failed for curing SARS-CoV-2 infected patients. It is urgent to generate new chemical entities against this virus . . . Claim 2:The coronavirus cannot thrive in warmer climates. Supports: ...most outbreaks display a pattern of clustering in relatively cool and dry areas...This is because the environment can mediate human-to-human transmission of SARS-CoV-2, and unsuitable climates can cause the virus to destabilize quickly... Refutes: ...significant cases in the coming months are likely to occur in more humid (warmer) climates, irrespective of the climate-dependence of transmission and that summer temperatures will not substrantially limit pandemic growth.
Peer reviewing is a central component in the scientific publishing process. We present the first public dataset of scientific peer reviews available for research purposes (PeerRead v1), 1 providing an opportunity to study this important artifact. The dataset consists of 14.7K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR. The dataset also includes 10.7K textual peer reviews written by experts for a subset of the papers. We describe the data collection process and report interesting observed phenomena in the peer reviews. We also propose two novel NLP tasks based on this dataset and provide simple baseline models. In the first task, we show that simple models can predict whether a paper is accepted with up to 21% error reduction compared to the majority baseline. In the second task, we predict the numerical scores of review aspects and show that simple models can outperform the mean baseline for aspects with high variance such as 'originality' and 'impact'. 2 The 20 th SIGNLL Conference on Computational Natural Language Learning; http://www.conll.org/2016 3 The 55 th Annual Meeting of the Association for Computational Linguistics; http://acl2017.org/ 4 The Conference on Neural Information Processing Systems; https://nips.cc/ 5 http://openreview.net 6 The 5 th International Conference on Learning Representations; https://iclr.cc/archive/www/2017.html 7The platform also allows any person to review the paper by adding a comment, but we only use the official reviews of reviewers assigned to review that paper. 8 https://arxiv.org/ 9 For consistency, we only include the first arXiv version of each paper (accepted or rejected) in the dataset.
Identifying the intent of a citation in scientific papers (e.g., background information, use of methods, comparing results) is critical for machine reading of individual publications and automated analysis of the scientific literature. We propose structural scaffolds, a multitask model to incorporate structural information of scientific papers into citations for effective classification of citation intents. Our model achieves a new state-ofthe-art on an existing ACL anthology dataset (ACL-ARC) with a 13.3% absolute increase in F1 score, without relying on external linguistic resources or hand-engineered features as done in existing methods. In addition, we introduce a new dataset of citation intents (Sci-Cite) which is more than five times larger and covers multiple scientific domains compared with existing datasets. Our code and data are available at: https://github.com/ allenai/scicite.
Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SCIREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N -ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to documentlevel IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https:
Key Points Question What is the magnitude of female underrepresentation in clinical studies? Findings In this cross-sectional study, machine reading to extract sex data from 43 135 published articles and 13 165 clinical trial records showed substantial underrepresentation of female participants, with studies as measurement unit, in 7 of 11 disease categories, especially HIV/AIDS, chronic kidney diseases, and cardiovascular diseases. Sex bias in articles for all categories combined was unchanged over time with studies as the measurement unit but improved with participants as measurement unit. Meaning This study suggests that sex bias against female participants in clinical studies persists, but results differ when studies vs participants are the measurement units.
The COVID-19 pandemic has spawned a diverse body of scientific literature that is challenging to navigate, stimulating interest in automated tools to help find useful knowledge. We pursue the construction of a knowledge base (KB) of mechanisms-a fundamental concept across the sciences, which encompasses activities, functions and causal relations, ranging from cellular processes to economic impacts. We extract this information from the natural language of scientific papers by developing a broad, unified schema that strikes a balance between relevance and breadth. We annotate a dataset of mechanisms with our schema and train a model to extract mechanism relations from papers. Our experiments demonstrate the utility of our KB in supporting interdisciplinary scientific search over COVID-19 literature, outperforming the prominent PubMed search in a study with clinical experts. Our search engine, dataset and code are publicly available. 1 * * Equal contribution. 1 https://covidmechanisms.apps.allenai.org/ … a deep learning framework for design of antiviral candidate drugs Temperature increase can facilitate the destruction of SARS-COV-2 gpl16 antiserum blocks binding of virions to cellular receptors ...food price inflation is an unintended consequence of COVID-19 containment measures Retrieved from CORD-19 papers Ent1: deep learning Ent2: drugs Query mechanism relations
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.