Deep catalogs of genetic variation from thousands of humans enable the detection of intraspecies constraint by identifying coding regions with a scarcity of variation. While existing techniques summarize constraint for entire genes, single gene-wide metrics conceal regional constraint variability within each gene. Therefore, we have created a detailed map of constrained coding regions (CCRs) by leveraging variation observed among 123,136 humans from the Genome Aggregation Database. The most constrained CCRs are enriched for pathogenic variants in ClinVar and mutations underlying developmental disorders. CCRs highlight protein domain families under high constraint and suggest unannotated or incomplete protein domains. The highest-percentile CCRs complement existing variant prioritization methods when evaluating de novo mutations in studies of autosomal dominant disease. Finally, we identify highly constrained CCRs within genes lacking known disease associations. This observation suggests that CCRs may identify regions under strong purifying selection that, when mutated, cause severe developmental phenotypes or embryonic lethality.
Deep catalogs of genetic variation collected from many thousands of humans enable the detection of intraspecies constraint by revealing coding regions with a scarcity of variation. While existing techniques summarize constraint for entire genes, single metrics cannot capture the fine-scale variability in constraint within each protein-coding gene. To provide greater resolution, we have created a detailed map of constrained coding regions (CCRs) in the human genome by leveraging coding variation observed among 123,136 humans from the Genome Aggregation Database (gnomAD). The most constrained coding regions in our map are enriched for both pathogenic variants in ClinVar and de novo mutations underlying developmental disorders. CCRs also reveal protein domain families under high constraint, suggest unannotated or incomplete protein domains, and facilitate the prioritization of previously unseen variation in studies of disease. Finally, a subset of CCRs with the highest constraint likely exist within genes that cause yet unobserved human phenotypes owing to strong purifying selection.
Coding variants in epigenetic regulators are emerging as causes of neurological dysfunction and cancer. However, a comprehensive effort to identify disease candidates within the human epigenetic machinery (EM) has not been performed; it is unclear whether features exist that distinguish between variation-intolerant and variation-tolerant EM genes, and between EM genes associated with neurological dysfunction versus cancer. Here, we rigorously define 295 genes with a direct role in epigenetic regulation (writers, erasers, remodelers, readers). Systematic exploration of these genes reveals that although individual enzymatic functions are always mutually exclusive, readers often also exhibit enzymatic activity (dual-function EM genes). We find that the majority of EM genes are very intolerant to loss-of-function variation, even when compared to the dosage sensitive transcription factors, and we identify 102 novel EM disease candidates. We show that this variation intolerance is driven by the protein domains encoding the epigenetic function, suggesting that disease is caused by a perturbed chromatin state. We then describe a large subset of EM genes that are coexpressed within multiple tissues. This subset is almost exclusively populated by extremely variation-intolerant genes and shows enrichment for dual-function EM genes. It is also highly enriched for genes associated with neurological dysfunction, even when accounting for dosage sensitivity, but not for cancer-associated EM genes. Finally, we show that regulatory regions near epigenetic regulators are genetically important for common neurological traits. These findings prioritize novel disease candidate EM genes and suggest that this coexpression plays a functional role in normal neurological homeostasis.
SV-plaudit is a framework for rapidly curating structural variant (SV) predictions. For each SV, we generate an image that visualizes the coverage and alignment signals from a set of samples. Images are uploaded to our cloud framework where users assess the quality of each image using a client-side web application. Reports can then be generated as a tab-delimited file or annotated Variant Call Format (VCF) file. As a proof of principle, nine researchers collaborated for 1 hour to evaluate 1,350 SVs each. We anticipate that SV-plaudit will become a standard step in variant calling pipelines and the crowd-sourced curation of other biological results.Code available at https://github.com/jbelyeu/SV-plauditDemonstration video available at https://www.youtube.com/watch?v=ono8kHMKxDs
Human Phenotype Ontology (HPO) terms are increasingly used in diagnostic settings to aid in the characterization of patient phenotypes. The HPO annotation database is updated frequently and can provide detailed phenotype knowledge on various human diseases, and many HPO terms are now mapped to candidate causal genes with binary relationships. To further improve the genetic diagnosis of rare diseases, we incorporated these HPO annotations, gene–disease databases and gene–gene databases in a probabilistic model to build a novel HPO-driven gene prioritization tool, Phen2Gene. Phen2Gene accesses a database built upon this information called the HPO2Gene Knowledgebase (H2GKB), which provides weighted and ranked gene lists for every HPO term. Phen2Gene is then able to access the H2GKB for patient-specific lists of HPO terms or PhenoPacket descriptions supported by GA4GH (http://phenopackets.org/), calculate a prioritized gene list based on a probabilistic model and output gene–disease relationships with great accuracy. Phen2Gene outperforms existing gene prioritization tools in speed and acts as a real-time phenotype-driven gene prioritization tool to aid the clinical diagnosis of rare undiagnosed diseases. In addition to a command line tool released under the MIT license (https://github.com/WGLab/Phen2Gene), we also developed a web server and web service (https://phen2gene.wglab.org/) for running the tool via web interface or RESTful API queries. Finally, we have curated a large amount of benchmarking data for phenotype-to-gene tools involving 197 patients across 76 scientific articles and 85 patients’ de-identified HPO term data from the Children’s Hospital of Philadelphia.
Human Phenotype Ontology (HPO) terms are increasingly used in diagnostic settings to aid in the characterization of patient phenotypes. The HPO annotation database is updated frequently and can provide detailed phenotype knowledge on various human diseases, and many HPO terms are now mapped to candidate causal genes with binary relationships. To further improve the genetic diagnosis of rare diseases, we incorporated these HPO annotations, gene-disease databases, and gene-gene databases in a probabilistic model to build a novel HPO-driven gene prioritization tool, Phen2Gene. Phen2Gene accesses a database built upon this information called the HPO2Gene Knowledgebase (H2GKB), which provides weighted and ranked gene lists for every HPO term. Phen2Gene is then able to access the H2GKB for patient-specific lists of HPO terms or PhenoPackets descriptions supported by GA4GH
Background Natural language processing (NLP) tools can facilitate the extraction of biomedical concepts from unstructured free texts, such as research articles or clinical notes. The NLP software tools CLAMP, cTAKES, and MetaMap are among the most widely used tools to extract biomedical concept entities. However, their performance in extracting disease-specific terminology from literature has not been compared extensively, especially for complex neuropsychiatric disorders with a diverse set of phenotypic and clinical manifestations. Methods We comparatively evaluated these NLP tools using autism spectrum disorder (ASD) as a case study. We collected 827 ASD-related terms based on previous literature as the benchmark list for performance evaluation. Then, we applied CLAMP, cTAKES, and MetaMap on 544 full-text articles and 20,408 abstracts from PubMed to extract ASD-related terms. We evaluated the predictive performance using precision, recall, and F1 score. Results We found that CLAMP has the best performance in terms of F1 score followed by cTAKES and then MetaMap. Our results show that CLAMP has much higher precision than cTAKES and MetaMap, while cTAKES and MetaMap have higher recall than CLAMP. Conclusion The analysis protocols used in this study can be applied to other neuropsychiatric or neurodevelopmental disorders that lack well-defined terminology sets to describe their phenotypic presentations.
Background Clinical phenotype information greatly facilitates genetic diagnostic interpretations pipelines in disease. While post-hoc extraction using natural language processing on unstructured clinical notes continues to improve, there is a need to improve point-of-care collection of patient phenotypes. Therefore, we developed “PheNominal”, a point-of-care web application, embedded within Epic electronic health record (EHR) workflows, to permit capture of standardized phenotype data. Methods Using bi-directional web services available within commercial EHRs, we developed a lightweight web application that allows users to rapidly browse and identify relevant terms from the Human Phenotype Ontology (HPO). Selected terms are saved discretely within the patient’s EHR, permitting reuse both in clinical notes as well as in downstream diagnostic and research pipelines. Results In the 16 months since implementation, PheNominal was used to capture discrete phenotype data for over 1500 individuals and 11,000 HPO terms during clinic and inpatient encounters for a genetic diagnostic consultation service within a quaternary-care pediatric academic medical center. An average of 7 HPO terms were captured per patient. Compared to a manual workflow, the average time to enter terms for a patient was reduced from 15 to 5 min per patient, and there were fewer annotation errors. Conclusions Modern EHRs support integration of external applications using application programming interfaces. We describe a practical application of these interfaces to facilitate deep phenotype capture in a discrete, structured format within a busy clinical workflow. Future versions will include a vendor-agnostic implementation using FHIR. We describe pilot efforts to integrate structured phenotyping through controlled dictionaries into diagnostic and research pipelines, reducing manual effort for phenotype documentation and reducing errors in data entry.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.