We present the Database of Disordered Protein Prediction (D2P2), available at http://d2p2.pro (including website source code). A battery of disorder predictors and their variants, VL-XT, VSL2b, PrDOS, PV2, Espritz and IUPred, were run on all protein sequences from 1765 complete proteomes (to be updated as more genomes are completed). Integrated with these results are all of the predicted (mostly structured) SCOP domains using the SUPERFAMILY predictor. These disorder/structure annotations together enable comparison of the disorder predictors with each other and examination of the overlap between disordered predictions and SCOP domains on a large scale. D2P2 will increase our understanding of the interplay between disorder and structure, the genomic distribution of disorder, and its evolutionary history. The parsed data are made available in a unified format for download as flat files or SQL tables either by genome, by predictor, or for the complete set. An interactive website provides a graphical view of each protein annotated with the SCOP domains and disordered regions from all predictors overlaid (or shown as a consensus). There are statistics and tools for browsing and comparing genomes and their disorder within the context of their position on the tree of life.
BackgroundFeature selection, aiming to identify a subset of features among a possibly large set of features that are relevant for predicting a response, is an important preprocessing step in machine learning. In gene expression studies this is not a trivial task for several reasons, including potential temporal character of data. However, most feature selection approaches developed for microarray data cannot handle multivariate temporal data without previous data flattening, which results in loss of temporal information.We propose a temporal minimum redundancy - maximum relevance (TMRMR) feature selection approach, which is able to handle multivariate temporal data without previous data flattening. In the proposed approach we compute relevance of a gene by averaging F-statistic values calculated across individual time steps, and we compute redundancy between genes by using a dynamical time warping approach.ResultsThe proposed method is evaluated on three temporal gene expression datasets from human viral challenge studies. Obtained results show that the proposed method outperforms alternatives widely used in gene expression studies. In particular, the proposed method achieved improvement in accuracy in 34 out of 54 experiments, while the other methods outperformed it in no more than 4 experiments.ConclusionWe developed a filter-based feature selection method for temporal gene expression data based on maximum relevance and minimum redundancy criteria. The proposed method incorporates temporal information by combining relevance, which is calculated as an average F-statistic value across different time steps, with redundancy, which is calculated by employing dynamical time warping approach. As evident in our experiments, incorporating the temporal information into the feature selection process leads to selection of more discriminative features.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1423-9) contains supplementary material, which is available to authorized users.
BackgroundInfant birth weight is a complex quantitative trait associated with both neonatal and long-term health outcomes. Numerous studies have been published in which candidate genes (IGF1, IGF2, IGF2R, IGF binding proteins, PHLDA2 and PLAGL1) have been associated with birth weight, but these studies are difficult to reproduce in man and large cohort studies are needed due to the large inter individual variance in transcription levels. Also, very little of the trait variance is explained. We decided to identify additional candidates without regard for what is known about the genes. We hypothesize that DNA methylation differences between individuals can serve as markers of gene "expression potential" at growth related genes throughout development and that these differences may correlate with birth weight better than single time point measures of gene expression.MethodsWe performed DNA methylation and transcript profiling on cord blood and placenta from newborns. We then used novel computational approaches to identify genes correlated with birth weight.ResultsWe identified 23 genes whose methylation levels explain 70-87% of the variance in birth weight. Six of these (ANGPT4, APOE, CDK2, GRB10, OSBPL5 and REG1B) are associated with growth phenotypes in human or mouse models. Gene expression profiling explained a much smaller fraction of variance in birth weight than did DNA methylation. We further show that two genes, the transcriptional repressor MSX1 and the growth factor receptor adaptor protein GRB10, are correlated with transcriptional control of at least seven genes reported to be involved in fetal or placental growth, suggesting that we have identified important networks in growth control. GRB10 methylation is also correlated with genes involved in reactive oxygen species signaling, stress signaling and oxygen sensing and more recent data implicate GRB10 in insulin signaling.ConclusionsSingle time point measurements of gene expression may reflect many factors unrelated to birth weight, while inter-individual differences in DNA methylation may represent a "molecular fossil record" of differences in birth weight-related gene expression. Finding these "unexpected" pathways may tell us something about the long-term association between low birth weight and adult disease, as well as which genes may be susceptible to environmental effects. These findings increase our understanding of the molecular mechanisms involved in human development and disease progression.
BackgroundEarly classification of time series is beneficial for biomedical informatics problems such including, but not limited to, disease change detection. Early classification can be of tremendous help by identifying the onset of a disease before it has time to fully take hold. In addition, extracting patterns from the original time series helps domain experts to gain insights into the classification results. This problem has been studied recently using time series segments called shapelets. In this paper, we present a method, which we call Multivariate Shapelets Detection (MSD), that allows for early and patient-specific classification of multivariate time series. The method extracts time series patterns, called multivariate shapelets, from all dimensions of the time series that distinctly manifest the target class locally. The time series were classified by searching for the earliest closest patterns.ResultsThe proposed early classification method for multivariate time series has been evaluated on eight gene expression datasets from viral infection and drug response studies in humans. In our experiments, the MSD method outperformed the baseline methods, achieving highly accurate classification by using as little as 40%-64% of the time series. The obtained results provide evidence that using conventional classification methods on short time series is not as accurate as using the proposed methods specialized for early classification.ConclusionFor the early classification task, we proposed a method called Multivariate Shapelets Detection (MSD), which extracts patterns from all dimensions of the time series. We showed that the MSD method can classify the time series early by using as little as 40%-64% of the time series’ length.
Early classification of time series is prevalent in many time-sensitive applications such as, but not limited to, early warning of disease outcome and early warning of crisis in stock market. For example, early diagnosis allows physicians to design appropriate therapeutic strategies at early stages of diseases. However, practical adaptation of early classification of time series requires an easy to understand explanation (interpretability) and a measure of confidence of the prediction results (uncertainty estimates). These two aspects were not jointly addressed in previous time series early classification studies, such that a difficult choice of selecting one of these aspects is required. In this study, we propose a simple and yet effective method to provide uncertainty estimates for an interpretable early classification method. The question we address here is "how to provide estimates of uncertainty in regard to interpretable early prediction." In our extensive evaluation on twenty time series datasets we showed that the proposed method has several advantages over the state-of-the-art method that provides reliability estimates in early classification. Namely, the proposed method is more effective than the state-of-the-art method, is simple to implement, and provides interpretable results.
Abstract-Leveraging temporal observations to predict a patient's health state at a future period is a very challenging task. Providing such a prediction early and accurately allows for designing a more successful treatment that starts before a disease completely develops. Information for this kind of early diagnosis could be extracted by use of temporal data mining methods for handling complex multivariate time series. However, physicians usually prefer to use interpretable models that can be easily explained, rather than relying on more complex black-box approaches. In this study, a temporal data mining method is proposed for extracting interpretable patterns from multivariate time series data, which can be used to assist in providing interpretable early diagnosis. The problem is formulated as an optimizationbased binary classification task addressed in three steps. First, the time series data is transformed into a binary matrix representation suitable for application of classification methods. Second, a novel convex-concave optimization problem is defined to extract multivariate patterns from the constructed binary matrix. Then, a mixed integer discrete optimization formulation is provided to reduce the dimensionality and extract interpretable multivariate patterns. Finally, those interpretable multivariate patterns are used for early classification in challenging clinical applications. In the conducted experiments on two human viral infection datasets and a larger myocardial infarction dataset, the proposed method was more accurate and provided classifications earlier than three alternative state-of-the-art methods.
To combine prospective cohort studies, by including HLA harmonization, and estimate risk of islet autoimmunity and progression to clinical diabetes. RESEARCH DESIGN AND METHODSFor prospective cohorts in Finland, Germany, Sweden, and the U.S., 24,662 children at increased genetic risk for development of islet autoantibodies and type 1 diabetes have been followed. Following harmonization, the outcomes were analyzed in 16,709 infants-toddlers enrolled by age 2.5 years. RESULTSIn the infant-toddler cohort, 1,413 (8.5%) developed at least one autoantibody confirmed at two or more consecutive visits (seroconversion), 865 (5%) developed multiple autoantibodies, and 655 (4%) progressed to diabetes. The 15-year cumulative incidence of diabetes varied in children with one, two, or three autoantibodies at seroconversion: 45% (95% CI 40-52), 85% (78-90), and 92% (85-97), respectively. Among those with a single autoantibody, status 2 years after seroconversion predicted diabetes risk: 12% (10-25) if reverting to autoantibody negative, 30% (20-40) if retaining a single autoantibody, and 82% (80-95) if developing multiple autoantibodies. HLA-DR-DQ affected the risk of confirmed seroconversion and progression to diabetes in children with stable single-autoantibody status. Their 15-year diabetes incidence for higher-versus lower-risk genotypes was 40% (28-50) vs. 12% . The rate of progression to diabetes was inversely related to age at development of multiple autoantibodies, ranging from 20% per year to 6% per year in children developing multipositivity in #2 years or >7.4 years, respectively. CONCLUSIONSThe number of islet autoantibodies at seroconversion reliably predicts 15-year type 1 diabetes risk. In children retaining a single autoantibody, HLA-DR-DQ genotypes can further refine risk of progression.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.