Purpose: Cohort building is a powerful foundation for improving clinical care, performing research, clinical trial recruitment, and many other applications. We set out to build a cohort of all patients with monogenic conditions who have received a definitive causal gene diagnosis in a 3 million patient hospital system.
Methods: We define a subset of half (4,461) of OMIM curated diseases for which at least one monogenic causal gene is definitively known. We then introduce MonoMiner, a natural language processing framework to identify molecularly confirmed monogenic patients from free-text clinical notes.
Results: We show that ICD-10-CM codes cover only a fraction of known monogenic diseases, and even where available, code-based patient retrieval offers 0.12 precision. Searching by causal gene symbol offers great recall but an even worse 0.09 precision. MonoMiner achieves 7-9 times higher precision (0.82), with 0.88 precision on disease diagnosis alone, tagging 4,259 patients with 560 monogenic diseases and 534 causal genes, at 0.48 recall.
Conclusion: MonoMiner enables the discovery of a large, high-precision cohort of monogenic disease patients with an established molecular diagnosis, empowering numerous downstream uses. Because it relies only on clinical notes, MonoMiner is highly portable, and its approach is adaptable to other domains and languages.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.