BackgroundThe increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining?ResultsWe analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. aFor text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results.ConclusionsBefore applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.
Age is an important factor when considering phenotypic changes in health and disease. Currently, the use of age information in medicine is somewhat simplistic, with ages commonly being grouped into a small number of crude ranges reflecting the major stages of development and aging, such as childhood or adolescence. Here, we investigate the possibility of redefining age groups using the recently developed Age-Phenome Knowledge-base (APK) that holds over 35,000 literaturederived entries describing relationships between age and phenotype. Clustering of APK data suggests 13 new, partially overlapping, age groups. The diseases that define these groups suggest that the proposed divisions are biologically meaningful. We further show that the number of different age ranges that should be considered depends on the type of disease being evaluated. This finding was further strengthened by similar results obtained from clinical blood measurement data. The grouping of diseases that share a similar pattern of disease-related reports directly mirrors, in some cases, medical knowledge of disease-age relationships. In other cases, our results may be used to generate new and reasonable hypotheses regarding links between diseases.
The clinical notes in a given patient record contain much redundancy, in large part due to clinicians’ documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.
Using an available cross-species cDNA microarray is advantageous for examining multigene expression patterns in non-model organisms, saving the need for construction of species-specific arrays. The aim of the present study was to estimate relative efficiency of cross-species hybridizations across bony fishes, using bioinformatics tools. The methodology may serve also as a model for similar evaluations in other taxa. The theoretical evaluation was done by substituting comparative whole-transcriptome sequence similarity information into the thermodynamic hybridization equation. Complementary DNA sequence assemblages of nine fish species belonging to common families or suborders and distributed across the bony fish taxonomic branch were selected for transcriptome-wise comparisons. Actual cross-species hybridizations among fish of different taxonomic distances were used to validate and eventually to calibrate the theoretically computed relative efficiencies.
Traditional lectures have limited ability to maintain attention and to promote changes in behaviour. Active learning, which stimulates the audience to think and participate, may be more effective. We describe our experience with an interactive polling system in lectures to physicians and students. Audience's answers to questions are displayed, providing instant feedback to both lecturer and audience, and promoting the use of case discussions and problem-solving exercises. In our experience, this modality improves the quality of clinical learning and deserves further evaluation.
Several methods have been proposed for detecting insertion/deletions (indels) from chromatograms generated by Sanger sequencing. However, most such methods are unsuitable when the mutated and normal variants occur at unequal ratios, such as is expected to be the case in cancer, with organellar DNA or with alternatively spliced RNAs. In addition, the current methods do not provide robust estimates of the statistical confidence of their results, and the sensitivity of this approach has not been rigorously evaluated. Here, we present CHILD, a tool specifically designed for indel detection in mixtures where one variant is rare. CHILD makes use of standard sequence alignment statistics to evaluate the significance of the results. The sensitivity of CHILD was tested by sequencing controlled mixtures of deleted and undeleted plasmids at various ratios. Our results indicate that CHILD can identify deleted molecules present as just 5% of the mixture. Notably, the results were plasmid/primer-specific; for some primers and/or plasmids, the deleted molecule was only detected when it comprised 10% or more of the mixture. The false positive rate was estimated to be lower than 0.4%. CHILD was implemented as a user-oriented web site, providing a sensitive and experimentally validated method for the detection of rare indel-carrying molecules in common Sanger sequence reads.
The identification of genomic loci associated with human genetic syndromes has been significantly facilitated through the generation of high density SNP arrays. However, optimal selection of candidate genes from within such loci is still a tedious labor-intensive bottleneck. Syndrome to Gene (S2G) is based on novel algorithms which allow an efficient search for candidate genes in a genomic locus, using known genes whose defects cause phenotypically similar syndromes. S2G (http://fohs.bgu.ac.il/s2g/index.html) includes two components: a phenotype Online Mendelian Inheritance in Man (OMIM)-based search engine that alleviates many of the problems in the existing OMIM search engine (negation phrases, overlapping terms, etc.). The second component is a gene prioritizing engine that uses a novel algorithm to integrate information from 18 databases. When the detailed phenotype of a syndrome is inserted to the webbased software, S2G offers a complete improved search of the OMIM database for similar syndromes. The software then prioritizes a list of genes from within a genomic locus, based on their association with genes whose defects are known to underlie similar clinical syndromes. We demonstrate that in all 30 cases of novel disease genes identified in the past year, the disease gene was within the top 20% of candidate genes predicted by S2G, and in most cases-within the top 10%. Thus, S2G provides clinicians with an efficient tool for diagnosis and researchers with a candidate gene prediction tool based on phenotypic data and a wide range of gene data resources. S2G can also serve in studies of polygenic diseases, and in finding interacting molecules for any gene of choice.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.