Genome-wide proximity ligation assays allow the identification of chromatin contacts at unprecedented resolution. Several studies reveal that mammalian chromosomes are composed of topological domains (TDs) in sub-mega base resolution, which appear to be conserved across cell types and to some extent even between organisms. Identifying topological domains is now an important step toward understanding the structure and functions of spatial genome organization. However, current methods for TD identification demand extensive computational resources, require careful tuning and/or encounter inconsistencies in results. In this work, we propose an efficient and deterministic method, TopDom, to identify TDs, along with a set of statistical methods for evaluating their quality. TopDom is much more efficient than existing methods and depends on just one intuitive parameter, a window size, for which we provide easy-to-implement optimization guidelines. TopDom also identifies more and higher quality TDs than the popular directional index algorithm. The TDs identified by TopDom provide strong support for the cross-tissue TD conservation. Finally, our analysis reveals that the locations of housekeeping genes are closely associated with cross-tissue conserved TDs. The software package and source codes of TopDom are available at http://zhoulab.usc.edu/TopDom/.
Identification of unknown metabolites is a major challenge in metabolomics. Without the identities of the metabolites, the metabolome data generated from a biological sample cannot be readily linked with the proteomic and genomic information for studies in systems biology and medicine. We have developed a web-based metabolite identification tool ( http://www.mycompoundid.org ) that allows searching and interpreting mass spectrometry (MS) data against a newly constructed metabolome library composed of 8,021 known human endogenous metabolites and their predicted metabolic products (375,809 compounds from one metabolic reaction and 10,583,901 from two reactions). As an example, in the analysis of a simple extract of human urine or plasma and the whole human urine by liquid chromatography-mass spectrometry and MS/MS, we are able to identify at least two times more metabolites in these samples than by using a standard human metabolome library. In addition, it is shown that the evidence-based metabolome library (EML) provides a much superior performance in identifying putative metabolites from a human urine sample, compared to the use of the ChemPub and KEGG libraries.
BackgroundWith the developments of DNA sequencing technology, large amounts of sequencing data have become available in recent years and provide unprecedented opportunities for advanced association studies between somatic point mutations and cancer types/subtypes, which may contribute to more accurate somatic point mutation based cancer classification (SMCC). However in existing SMCC methods, issues like high data sparsity, small volume of sample size, and the application of simple linear classifiers, are major obstacles in improving the classification performance.ResultsTo address the obstacles in existing SMCC studies, we propose DeepGene, an advanced deep neural network (DNN) based classifier, that consists of three steps: firstly, the clustered gene filtering (CGF) concentrates the gene data by mutation occurrence frequency, filtering out the majority of irrelevant genes; secondly, the indexed sparsity reduction (ISR) converts the gene data into indexes of its non-zero elements, thereby significantly suppressing the impact of data sparsity; finally, the data after CGF and ISR is fed into a DNN classifier, which extracts high-level features for accurate classification. Experimental results on our curated TCGA-DeepGene dataset, which is a reformulated subset of the TCGA dataset containing 12 selected types of cancer, show that CGF, ISR and DNN all contribute in improving the overall classification performance. We further compare DeepGene with three widely adopted classifiers and demonstrate that DeepGene has at least 24% performance improvement in terms of testing accuracy.ConclusionsBased on deep learning and somatic point mutation data, we devise DeepGene, an advanced cancer type classifier, which addresses the obstacles in existing SMCC studies. Experiments indicate that DeepGene outperforms three widely adopted existing classifiers, which is mainly attributed to its deep learning module that is able to extract the high level features between combinatorial somatic point mutations and cancer types.
We report an analytical tool to facilitate metabolite identification based on an MS/MS spectral match of an unknown to a library of predicted MS/MS spectra of possible human metabolites. To construct the spectral library, the known endogenous human metabolites in the Human Metabolome Database (HMDB) (8,021 metabolites) and their predicted metabolic products via one metabolic reaction in the Evidence-based Metabolome Library (EML) (375,809 predicted metabolites) were subjected to in silico fragmentation to produce the predicted MS/MS spectra. This spectral library is hosted at the public MCID Web site ( www.MyCompoundID.org ), and a spectral search program, MCID MS/MS, has been developed to allow a user to search one or a batch of experimental MS/MS spectra against the library spectra for possible match(s). Using MS/MS spectra generated from standard metabolites and a human urine sample, we demonstrate that this tool is very useful for putative metabolite identification. It allows a user to narrow down many possible structures initially found by using an accurate mass search of an unknown metabolite to only one or a few candidates, thereby saving time and effort in selecting or synthesizing metabolite standard(s) for eventual positive metabolite identification.
BackgroundThe differentiation and maturation trajectories of fetal liver stem/progenitor cells (LSPCs) are not fully understood at single-cell resolution, and a priori knowledge of limited biomarkers could restrict trajectory tracking.ResultsWe employed marker-free single-cell RNA-Seq to characterize comprehensive transcriptional profiles of 507 cells randomly selected from seven stages between embryonic day 11.5 and postnatal day 2.5 during mouse liver development, and also 52 Epcam-positive cholangiocytes from postnatal day 3.25 mouse livers. LSPCs in developing mouse livers were identified via marker-free transcriptomic profiling. Single-cell resolution dynamic developmental trajectories of LSPCs exhibited contiguous but discrete genetic control through transcription factors and signaling pathways. The gene expression profiles of cholangiocytes were more close to that of embryonic day 11.5 rather than other later staged LSPCs, cuing the fate decision stage of LSPCs. Our marker-free approach also allows systematic assessment and prediction of isolation biomarkers for LSPCs.ConclusionsOur data provide not only a valuable resource but also novel insights into the fate decision and transcriptional control of self-renewal, differentiation and maturation of LSPCs.Electronic supplementary materialThe online version of this article (10.1186/s12864-017-4342-x) contains supplementary material, which is available to authorized users.
Context Accurate methods for early gestational diabetes mellitus (GDM) (during the first trimester of pregnancy) prediction in Chinese and other populations are lacking. Objectives This work aimed to establish effective models to predict early GDM. Methods Pregnancy data for 73 variables during the first trimester were extracted from the electronic medical record system. Based on a machine learning (ML)-driven feature selection method, 17 variables were selected for early GDM prediction. To facilitate clinical application, 7 variables were selected from the 17-variable panel. Advanced ML approaches were then employed using the 7-variable data set and the 73-variable data set to build models predicting early GDM for different situations, respectively. Results A total of 16 819 and 14 992 cases were included in the training and testing sets, respectively. Using 73 variables, the deep neural network model achieved high discriminative power, with area under the curve (AUC) values of 0.80. The 7-variable logistic regression (LR) model also achieved effective discriminate power (AUC = 0.77). Low body mass index (BMI) (≤ 17) was related to an increased risk of GDM, compared to a BMI in the range of 17 to 18 (minimum risk interval) (11.8% vs 8.7%, P = .09). Total 3,3,5′-triiodothyronine (T3) and total thyroxin (T4) were superior to free T3 and free T4 in predicting GDM. Lipoprotein(a) was demonstrated a promising predictive value (AUC = 0.66). Conclusions We employed ML models that achieved high accuracy in predicting GDM in early pregnancy. A clinically cost-effective 7-variable LR model was simultaneously developed. The relationship of GDM with thyroxine and BMI was investigated in the Chinese population.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.