PubRunner: A light-weight framework for updating text mining results

López

et al. 2018

PeerJ

A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the second version of Biotea, a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology. We expose our models, services, software and datasets. Our infrastructure enables manual and semi-automatic annotation, resulting data are represented as RDF-based linked data and can be readily queried using the SPARQL query language. We illustrate the utility of our system with several use cases. Our datasets, methods and techniques are available at http://biotea.github.io.

Section: Discussionmentioning

confidence: 99%

Biotea: semantics for Pubmed Central

López

et al. 2018

PeerJ

“…We then used the PubRunner infrastructure to apply these two classifiers across all the aligned sentences. 25 This enabled the use of a compute cluster to quickly classify sentences as to whether they contain pharmacogenomic information. We then outputted relations along with the normalized form of the chemical and genes and other metadata.…”

Section: Methodsmentioning

confidence: 99%

PGxMine: Text mining for curation of PharmGKB

et al. 2019

Self Cite

Precision medicine tailors treatment to individuals personal data including differences in their genome. The Pharmacogenomics Knowledgebase (PharmGKB) provides highly curated information on the effect of genetic variation on drug response and side effects for a wide range of drugs. PharmGKB’s scientific curators triage, review and annotate a large number of papers each year but the task is challenging. We present the PGxMine resource, a text-mined resource of pharmacogenomic associations from all accessible published literature to assist in the curation of PharmGKB. We developed a supervised machine learning pipeline to extract associations between a variant (DNA and protein changes, star alleles and dbSNP identifiers) and a chemical. PGxMine covers 452 chemicals and 2,426 variants and contains 19,930 mentions of pharmacogenomic associations across 7,170 papers. An evaluation by PharmGKB curators found that 57 of the top 100 associations not found in PharmGKB led to 83 curatable papers and a further 24 associations would likely lead to curatable papers through citations. The results can be viewed at https://pgxmine.pharmgkb.org/ and code can be downloaded at https://github.com/jakelever/pgxmine.

Data and Information Management

“…Our approach is compatible with other text mining frameworks, such as PubRunner 27 , for updating processed citations with the latest PubMed entries, and the many available text processing toolkits which can be used to process raw article metadata into processed feature sets, for example the NLTK (http://www.nltk.org/), the Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/), and Apache Open NLP (http://opennlp.apache.org/). The approach is also amenable to implementation on large scale parallel processing data analytic systems, such as Apache Spark (https://spark.apache.org/), which includes parallel implementations of several machine learning algorithms including SVM 28,29 .…”

Section: Can This Framework Be Generalized To Other Biomedical Text Mmentioning

confidence: 96%

Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database

Smalheiser¹,

Cohen²

2018

Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and employ machine learning algorithms. At present, each research group tackles each problem from scratch, and in isolation of other projects, which causes redundancy and great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects, and can serve as a public repository for their outputs. We will initially focus on a specific goal, namely, classifying articles according to Publication Type, and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning based goals and projects, and can be used as a public platform for disseminating the results of NLP tools to end-users as well.