SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research*

Wu, Honghan; Toti, Giulia; Morley, Katherine I.; Ibrahim, Zina; Folarin, Amos; Jackson, Roy; Kartoglu, Ismail E.; Agrawal, Asha; Stringer, Clive; Gale, Darren; Gorrell, Genevieve; Roberts, Angus; Broadbent, Matthew; Stewart, Robert; Dobson, Richard

doi:10.1093/jamia/ocx160

Cited by 97 publications

(106 citation statements)

References 23 publications

(18 reference statements)

Supporting

Mentioning

102

Contrasting

Order By: Relevance

“…al. [33]), many of which are made freely available and open source, have been intensively investigated in mining free-text medical records [10,[34][35][36]. To provide guidance in the efficient reuse of pre-trained NLP models, we have here proposed an approach that can automatically (i) identify easy cases in a new task for the reused model, on which it can achieve good performance with high confidence; (ii) classify the remainder of the cases so that the validation or retraining on them can be conducted much more efficiently, compared to adapting the model on all cases.…”

Section: Principal Resultsmentioning

confidence: 99%

“…Automated approaches are essential to surface such deep data from free-text clinical notes at scale. To make NLP tools accessible for clinical applications, various approaches have been proposed, including generic, user-friendly tools [8][9][10] and web services or cloud based solutions [11][12][13]. Among these approaches, perhaps the most efficient way to facilitate clinical NLP projects is to adapt pre-trained NLP models in new but similar settings [14], i.e.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Reuse of Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: A Phenotype Embedding Approach

Wu¹,

Hodgson²,

Dyson³

et al. 2019

JMIR Med Inform

Self Cite

View full text Add to dashboard Cite

Background:Many efforts have been put into the use of automated approaches, such as natural language processing (NLP), to mine or extract data from free-text medical records to construct comprehensive patient profiles for delivering better health-care. Reusing NLP models in new settings, however, remains cumbersome -requiring validation and/or retraining on new data iteratively to achieve convergent results. Objective: The aim of this work is to minimize the effort involved in reusing NLP models on free-text medical records. Methods: We formally define and analyse the model adaptation problem in phenotype-mention identification tasks. We identify "duplicate waste" and "imbalance waste", which collectively impede efficient model reuse. We propose a phenotype embedding based approach to minimize these sources of waste without the need for labelled data from new settings. Results:We conduct experiments on data from a large mental health registry to reuse NLP models in four phenotype-mention identification tasks. The proposed approach can choose the best model for a new task, identifying up to 76% (duplicate waste), i.e. phenotype mentions without the need for validation and model retraining, and with very good performance (93-97% accuracy). It can also provide guidance for validating and retraining the selected model for novel language patterns in new tasks, saving around 80% (imbalance waste), i.e. the effort required in "blind" model-adaptation approaches. Conclusions: Adapting pre-trained NLP models for new tasks can be more efficient and effective if the language pattern landscapes of old settings and new settings can be made explicit and comparable. Our experiments show that the phenotypemention embedding approach is an effective way to model language patterns for phenotype-mention identification tasks and that its use can guide efficient NLP model reuse.

show abstract

Section: Principal Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Efficient Reuse of Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: A Phenotype Embedding Approach

Wu¹,

Hodgson²,

Dyson³

et al. 2019

JMIR Med Inform

Self Cite

View full text Add to dashboard Cite

show abstract

“…in the UK cross-referencing against multiple EHR sources, prognostic validation and risk factor validation are all made possible by nationwide population-based records [28][29][30][31][32]. In contrast with the US, only recently have scalable methods been developed to access the entire hospital record for expert review [33] and text corpora are not available at scale [34]. There have been few previous studies [35] of the validity of International Classification of Disease and Health Related Problems, 10th Revision (ICD-10)…”

Section: Background and Significancementioning

confidence: 99%

UK phenomics platform for developing and validating EHR phenotypes: CALIBER

Denaxas

González-Izquierdo

Direk

et al. 2019

Preprint

View full text Add to dashboard Cite

Objective Electronic Health Records (EHR) are a rich source of information on human diseases, but the information is variably structured, fragmented, curated using different coding systems and collected for purposes other than medical research. We describe an approach for developing, validating and sharing reproducible phenotypes from national structured EHR in the United Kingdom (UK) with applications for translational research. Materials and MethodsWe implemented a rule-based phenotyping framework, with up to six approaches of validation. We applied our framework to a sample of 15 million individuals in a national EHR data source (population-based primary care, all ages) linked to hospitalization and death records in England. Data comprised continuous measurements e.g. blood pressure, medication information and coded diagnoses, symptoms, procedures and referrals, recorded using five controlled clinical terminologies: a) Read (primary care, subset of SNOMED-CT), b) International Classification of Diseases 9th/10th Revision (ICD-9, ICD-10, secondary care diagnoses and cause of mortality), c) OPCS Classification of Interventions and Procedures (OPCS-4, hospital surgical procedures), and d) DM+D prescription codes. Results Using the CALIBER phenotyping framework, we created algorithms for 51 diseases, syndromes, biomarkers and lifestyle risk factors and provide up to six validation approaches. The EHR phenotypes are curated in the open-access CALIBER Portal (https://www.caliberresearch.org/portal) and have been used by 40 national/international research groups in 60 peer-reviewed publications.

show abstract

“…Another thread of work has focused on making querying easier to carry out, typically through development of natural language or other structured interfaces to the patient data [22][23][24][25]. Other approaches focus on normalizing semantic representation of patient data within the EHR itself [26] and applying deep learning to non-topical characteristics of studies and researchers [27]. A related area to cohort discovery is patient phenotyping, one of the goals of which is to identify patients for clinical studies [28][29][30].…”

Section: Introductionmentioning

confidence: 99%

Evaluation of Patient-Level Retrieval from Electronic Health Record Data for a Cohort Discovery Task

Bedrick

Cohen

Wang

et al. 2019

Preprint

View full text Add to dashboard Cite

Objective Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well-understood. The objective of this research was to assess patient-level information retrieval (IR) methods using electronic health records (EHR) for different types of cohort definition retrieval. Materials and Methods We developed a test collection consisting of about 100,000 patient records and 56 test topics that characterized patient cohort requests for various clinical studies. Automated IR tasks using word-based approaches were performed, varying four different parameters for a total of 48 permutations, with performance measured using B-Pref. We subsequently created structured Boolean queries for the 56 topics for performance comparisons. In addition, we performed a more detailed analysis of 10 topics. Results The best-performing word-based automated query parameter settings achieved a mean B-Pref of 0.167 across all 56 topics. The way a topic was structured (topic representation) had the largest impact on performance. Performance not only varied widely across topics, but there was also a large variance in sensitivity to parameter settings across the topics. Structured queries generally performed better than automated queries on measures of recall and precision, but were still not able to recall all relevant patients found by the automated queries. Conclusion While word-based automated methods of cohort retrieval offer an attractive solution to the labor-intensive nature of this task currently used at many medical centers, we generally found suboptimal performance in those approaches, with better performance obtained from structured Boolean queries. Insights gained in this preliminary analysis will help guide future work to develop new methods for patient-level cohort discovery with EHR data.

show abstract

SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research*

Cited by 97 publications

References 23 publications

Efficient Reuse of Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: A Phenotype Embedding Approach

Efficient Reuse of Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: A Phenotype Embedding Approach

UK phenomics platform for developing and validating EHR phenotypes: CALIBER

Evaluation of Patient-Level Retrieval from Electronic Health Record Data for a Cohort Discovery Task

Contact Info

Product

Resources

About