Here, we propose a heuristic technique of data trimming for SVM termed FLOating Window Projective Separator (FloWPS), tailored for personalized predictions based on molecular data. This procedure can operate with high throughput genetic datasets like gene expression or mutation profiles. Its application prevents SVM from extrapolation by excluding non-informative features. FloWPS requires training on the data for the individuals with known clinical outcomes to create a clinically relevant classifier. The genetic profiles linked with the outcomes are broken as usual into the training and validation datasets. The unique property of FloWPS is that irrelevant features in validation dataset that don’t have significant number of neighboring hits in the training dataset are removed from further analyses. Next, similarly to the k nearest neighbors (kNN) method, for each point of a validation dataset, FloWPS takes into account only the proximal points of the training dataset. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. FloWPS performance was tested on ten gene expression datasets for 992 cancer patients either responding or not on the different types of chemotherapy. We experimentally confirmed by leave-one-out cross-validation that FloWPS enables to significantly increase quality of a classifier built based on the classical SVM in most of the applications, particularly for polynomial kernels.
Personalized medicine implies that distinct treatment methods are prescribed to individual patients according several features that may be obtained from, e.g., gene expression profile. The majority of machine learning methods suffer from the deficiency of preceding cases, i.e. the gene expression data on patients combined with the confirmed outcome of known treatment methods. At the same time, there exist thousands of various cell lines that were treated with hundreds of anti-cancer drugs in order to check the ability of these drugs to stop the cell proliferation, and all these cell line cultures were profiled in terms of their gene expression. Here we present a new approach in machine learning, which can predict clinical efficiency of anti-cancer drugs for individual patients by transferring features obtained from the expression-based data from cell lines. The method was validated on three datasets for cancer-like diseases (chronic myeloid leukemia, as well as lung adenocarcinoma and renal carcinoma) treated with targeted drugs - kinase inhibitors, such as imatinib or sorafenib.
Endogenous retroviruses and retrotransposons also termed retroelements (REs) are mobile genetic elements that were active until recently in human genome evolution. REs regulate gene expression by actively reshaping chromatin structure or by directly providing transcription factor binding sites (TFBSs). We aimed to identify molecular processes most deeply impacted by the REs in human cells at the level of TFBS regulation. By using ENCODE data, we identified ~2 million TFBS overlapping with putatively regulation-competent human REs located in 5-kb gene promoter neighborhood (~17% of all TFBS in promoter neighborhoods; ~9% of all RE-linked TFBS). Most of REs hosting TFBS were highly diverged repeats, and for the evolutionary young (0–8% diverged) elements we identified only ~7% of all RE-linked TFBS. The gene-specific distributions of RE-linked TFBS generally correlated with the distributions for all TFBS. However, several groups of molecular processes were highly enriched in the RE-linked TFBS regulation. They were strongly connected with the immunity and response to pathogens, with the negative regulation of gene transcription, ubiquitination, and protein degradation, extracellular matrix organization, regulation of STAT signaling, fatty acids metabolism, regulation of GTPase activity, protein targeting to Golgi, regulation of cell division and differentiation, development and functioning of perception organs and reproductive system. By contrast, the processes most weakly affected by the REs were linked with the conservative aspects of embryo development. We also identified differences in the regulation features by the younger and older fractions of the REs. The regulation by the older fraction of the REs was linked mainly with the immunity, cell adhesion, cAMP, IGF1R, Notch, Wnt, and integrin signaling, neuronal development, chondroitin sulfate and heparin metabolism, and endocytosis. The younger REs regulate other aspects of immunity, cell cycle progression and apoptosis, PDGF, TGF beta, EGFR, and p38 signaling, transcriptional repression, structure of nuclear lumen, catabolism of phospholipids, and heterocyclic molecules, insulin and AMPK signaling, retrograde Golgi-ER transport, and estrogen signaling. The immunity-linked pathways were highly represented in both categories, but their functional roles were different and did not overlap. Our results point to the most quickly evolving molecular pathways in the recent and ancient evolution of human genome.
BackgroundHarmonization techniques make different gene expression profiles and their sets compatible and ready for comparisons. Here we present a new bioinformatic tool termed Shambhala for harmonization of multiple human gene expression datasets obtained using different experimental methods and platforms of microarray hybridization and RNA sequencing.ResultsUnlike previously published methods enabling good quality data harmonization for only two datasets, Shambhala allows conversion of multiple datasets into the universal form suitable for further comparisons. Shambhala harmonization is based on the calibration of gene expression profiles using the auxiliary standardization dataset. Each profile is transformed to make it similar to the output of microarray hybridization platform Affymetrix Human Gene. This platform was chosen because it has the biggest number of human gene expression profiles deposited in public databases. We evaluated Shambhala ability to retain biologically important features after harmonization. The same four biological samples taken in multiple replicates were profiled independently using three and four different experimental platforms, respectively, then Shambhala-harmonized and investigated by hierarchical clustering.ConclusionOur results showed that unlike other frequently used methods: quantile normalization and DESeq/DESeq2 normalization, Shambhala harmonization was the only method supporting sample-specific and platform-independent biologically meaningful clustering for the data obtained from multiple experimental platforms.Electronic supplementary materialThe online version of this article (10.1186/s12859-019-2641-8) contains supplementary material, which is available to authorized users.
High throughput technologies opened a new era in biomedicine by enabling massive analysis of gene expression at both RNA and protein levels. Unfortunately, expression data obtained in different experiments are often poorly compatible, even for the same biologic samples. Here, using experimental and bioinformatic investigation of major experimental platforms, we show that aggregation of gene expression data at the level of molecular pathways helps to diminish cross- and intra-platform bias otherwise clearly seen at the level of individual genes. We created a mathematical model of cumulative suppression of data variation that predicts the ideal parameters and the optimal size of a molecular pathway. We compared the abilities to aggregate experimental molecular data for the 5 alternative methods, also evaluated by their capacity to retain meaningful features of biologic samples. The bioinformatic method OncoFinder showed optimal performance in both tests and should be very useful for future cross-platform data analyses.
Background: Transposons are selfish genetic elements that self-reproduce in host DNA. They were active during evolutionary history and now occupy almost half of mammalian genomes. Close insertions of transposons reshaped structure and regulation of many genes considerably. Co-evolution of transposons and host DNA frequently results in the formation of new regulatory regions. Previously we published a concept that the proportion of functional features held by transposons positively correlates with the rate of regulatory evolution of the respective genes. Methods: We ranked human genes and molecular pathways according to their regulatory evolution rates based on high throughput genome-wide data on five histone modifications (H3K4me3, H3K9ac, H3K27ac, H3K27me3, H3K9me3) linked with transposons for five human cell lines. Results: Based on the total of approximately 1.5 million histone tags, we ranked regulatory evolution rates for 25075 human genes and 3121 molecular pathways and identified groups of molecular processes that showed signs of either fast or slow regulatory evolution. However, histone tags showed different regulatory patterns and formed two distinct clusters: promoter/active chromatin tags (H3K4me3, H3K9ac, H3K27ac) vs. heterochromatin tags (H3K27me3, H3K9me3). Conclusion: In humans, transposon-linked histone marks evolved in a coordinated way depending on their functional roles.
Background: Retroelements (REs) are transposable elements occupying ~40% of the human genome that can regulate genes by providing transcription factor binding sites (TFBS). RE-linked TFBS profile can serve as a marker of gene transcriptional regulation evolution. This approach allows for interrogating the regulatory evolution of organisms with RE-rich genomes. We aimed to characterize the evolution of transcriptional regulation for human genes and molecular pathways using RE-linked TFBS accumulation as a metric. Methods: We characterized human genes and molecular pathways either enriched or deficient in RE-linked TFBS regulation. We used ENCODE database with mapped TFBS for 563 transcription factors in 13 human cell lines. For 24,389 genes and 3124 molecular pathways, we calculated the score of RE-linked TFBS regulation reflecting the regulatory evolution rate at the level of individual genes and molecular pathways. Results: The major groups enriched by RE regulation deal with gene regulation by microRNAs, olfaction, color vision, fertilization, cellular immune response, and amino acids and fatty acids metabolism and detoxication. The deficient groups were involved in translation, RNA transcription and processing, chromatin organization, and molecular signaling. Conclusion: We identified genes and molecular processes that have characteristics of especially high or low evolutionary rates at the level of RE-linked TFBS regulation in human lineage.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.