25 The scientific literature is vast, growing, and increasingly specialized, making it difficult to 26 connect disparate observations across subfields. To address this problem, we sought to develop 27 automated hypothesis generation by networking at scale the MeSH terms curated by the National 28 Library of Medicine. The result is a Mesh Term Objective Reasoning (MeTeOR) approach that 29 tallies associations among genes, drugs and diseases from PubMed and predicts new ones.30 Comparisons to reference databases and algorithms show MeTeOR tends to be more reliable. We 31 also show that many predictions based on the literature prior to 2014 were published 32 subsequently. In a practical application, we validated experimentally a surprising new 33 association found by MeTeOR between novel Epidermal Growth Factor Receptor (EGFR) 34 associations and CDK2. We conclude that MeTeOR generates useful hypotheses from the 35 literature (http://meteor.lichtargelab.org/).
AUTHOR SUMMARY37 The large size and exponential expansion of the scientific literature forms a bottleneck to 38 accessing and understanding published findings. Manual curation and Natural Language 39 Processing (NLP) aim to address this bottleneck by summarizing and disseminating the 40 knowledge within articles as key relationships (e.g. TP53 relates to Cancer). However, these 41 methods compromise on either coverage or accuracy, respectively. To mitigate this compromise, 42 we proposed using manually-assigned keywords (MeSH terms) to extract relationships from the 43 publications and demonstrated a comparable coverage but higher accuracy than current NLP 44 methods. Furthermore, we combined the extracted knowledge with semi-supervised machine 45 learning to create hypotheses to guide future work and discovered a direct interaction between 46 two important cancer genes. 47 48 49 3 50 INTRODUCTION 51 It is difficult to keep abreast of new publications. Currently, PubMed contains over 28 million 52 papers (http://www.ncbi.nlm.nih.gov/pubmed)-3 million more than three years ago. This steady 53 accumulation of findings gives rise to a large number of latent connections that Literature-Based 54 Discovery (LBD) seeks to systematically recognize and integrate [1], such as Swanson's original 55 finding linking fish oil to the treatment of Raynaud's disease [2]. Since this original analysis, 56 LBD has been extensively replicated, automated and expanded [3-10], leading to new patterns of 57 inference -e.g. locating opposing actions of a disease and a drug on given physiological 58 functions [11] -and to new discoveries [12]. Successes include the automated discovery of 59 protein functions [13, 14] and of the genetic bases of disease [15, 16], as well as the stratification 60 of patient phenotypes [17] and outcomes [18].
61A limitation of LBD, however, is its dependence on knowledge extraction. It either relies 62 on human curation, which is not scalable, or on comprehensive text-mining, for which 63 algorithms are less accurate [19, 20]. One of the largest curated m...