With the increasing popularity of scientific workflows, public repositories are gaining importance as a means to share, find, and reuse such workflows. As the sizes of these repositories grow, methods to compare the scientific workflows stored in them become a necessity, for instance, to allow duplicate detection or similarity search. Scientific workflows are complex objects, and their comparison entails a number of distinct steps from comparing atomic elements to comparison of the workflows as a whole. Various studies have implemented methods for scientific workflow comparison and came up with often contradicting conclusions upon which algorithms work best. Comparing these results is cumbersome, as the original studies mixed different approaches for different steps and used different evaluation data and metrics. We contribute to the field (i) by disecting each previous approach into an explicitly defined and comparable set of subtasks, (ii) by comparing in isolation different approaches taken at each step of scientific workflow comparison, reporting on an number of unexpected findings, (iii) by investigating how these can best be combined into aggregated measures, and (iv) by making available a gold standard of over 2000 similarity ratings contributed by 15 workflow experts on a corpus of almost 1500 workflows and re-implementations of all methods we evaluated.
Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein–protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271 000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.
One of the greatest strengths of artificial intelligence (AI) and machine learning (ML) approaches in health care is that their performance can be continually improved based on updates from automated learning from data. However, health care ML models are currently essentially regulated under provisions that were developed for an earlier age of slowly updated medical devices—requiring major documentation reshape and revalidation with every major update of the model generated by the ML algorithm. This creates minor problems for models that will be retrained and updated only occasionally, but major problems for models that will learn from data in real time or near real time. Regulators have announced action plans for fundamental changes in regulatory approaches. In this Viewpoint, we examine the current regulatory frameworks and developments in this domain. The status quo and recent developments are reviewed, and we argue that these innovative approaches to health care need matching innovative approaches to regulation and that these approaches will bring benefits for patients. International perspectives from the World Health Organization, and the Food and Drug Administration’s proposed approach, based around oversight of tool developers’ quality management systems and defined algorithm change protocols, offer a much-needed paradigm shift, and strive for a balanced approach to enabling rapid improvements in health care through AI innovation while simultaneously ensuring patient safety. The draft European Union (EU) regulatory framework indicates similar approaches, but no detail has yet been provided on how algorithm change protocols will be implemented in the EU. We argue that detail must be provided, and we describe how this could be done in a manner that would allow the full benefits of AI/ML-based innovation for EU patients and health care systems to be realized.
PURPOSE Precision oncology depends on the availability of up-to-date, comprehensive, and accurate information about associations between genetic variants and therapeutic options. Recently, a number of knowledge bases (KBs) have been developed that gather such information on the basis of expert curation of the scientific literature. We performed a quantitative and qualitative comparison of Clinical Interpretations of Variants in Cancer, OncoKB, Cancer Gene Census, Database of Curated Mutations, CGI Biomarkers (the cancer genome interpreter biomarker database), Tumor Alterations Relevant for Genomics-Driven Therapy, and the Precision Medicine Knowledge Base. METHODS We downloaded each KB and restructured their content to describe variants, genes, drugs, and gene-drug associations in a common format. We normalized gene names to Entrez Gene IDs and drug names to ChEMBL and DrugBank IDs. For the analysis of clinically relevant gene-drug associations, we obtained lists of genes affected by genetic alterations and putative drug therapies for 113 patients with cancer whose cases were presented at the Molecular Tumor Board (MTB) of the Charité Comprehensive Cancer Center. RESULTS Our analysis revealed that the KBs are largely overlapping but also that each source harbors a notable amount of unique information. Although some KBs cover more genes, others contain more data about gene-drug associations. Retrospective comparisons with findings of the Charitè MTB at the gene level showed that use of multiple KBs may considerably improve retrieval results. The relative importance of a KB in terms of cancer genes was assessed in more detail by logistic regression, which revealed that all but one source had a notable impact on result quality. We confirmed these findings using a second data set obtained from an independent MTB. CONCLUSION To date, none of the existing publicly available KBs on gene-drug associations in precision oncology fully subsumes the others, but all of them exhibit specific strengths and weaknesses. Consideration of multiple KBs, therefore, is essential to obtain comprehensive results.
Objective We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. Materials and Methods BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. Results The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72–0.90 for named entity recognition, 0.10–0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. Discussion Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. Conclusion To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.