Abstract:Independent Component Analysis (ICA) is an unsupervised machine learning algorithm that separates a set of mixed signals into a set of statistically independent source signals. Applied to high-quality gene expression datasets, ICA effectively reveals the source signals of the transcriptome as groups of co-regulated genes and their corresponding activities across diverse growth conditions. Two major variables that affect the output of ICA are the diversity and scope of the underlying data, and the user-defined … Show more
“…Five additional iModulons were dominated by a single, high-coefficient gene, and are automatically identified by the method find_single_gene_imodulons . These Single Gene (SG) iModulons may arise from over-decomposition of the dataset 30,37 or artificial knock-out or overexpression of single genes. Together, these iModulons contribute to 1% of the variance in the dataset.…”
Section: Resultsmentioning
confidence: 99%
“…To compute the optimal independent components, an extension of ICA was performed on the RNA-seq dataset as described in McConn et al 37 .…”
We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible and can be employed to support scientific research or build a holistic view of an organism. Here, we introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism's transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. The resulting reconstruction of the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets in the context of all published data. The results are hosted at https://imodulondb.org/, and additional analyses can be performed using the PyModulon Python package. As the number of publicly available datasets increases, this pipeline will be applicable to a wide range of microbial pathogens and cell factories.
“…Five additional iModulons were dominated by a single, high-coefficient gene, and are automatically identified by the method find_single_gene_imodulons . These Single Gene (SG) iModulons may arise from over-decomposition of the dataset 30,37 or artificial knock-out or overexpression of single genes. Together, these iModulons contribute to 1% of the variance in the dataset.…”
Section: Resultsmentioning
confidence: 99%
“…To compute the optimal independent components, an extension of ICA was performed on the RNA-seq dataset as described in McConn et al 37 .…”
We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible and can be employed to support scientific research or build a holistic view of an organism. Here, we introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism's transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. The resulting reconstruction of the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets in the context of all published data. The results are hosted at https://imodulondb.org/, and additional analyses can be performed using the PyModulon Python package. As the number of publicly available datasets increases, this pipeline will be applicable to a wide range of microbial pathogens and cell factories.
“…The final dataset was composed of 657 samples, spanning various conditions that describe M. tuberculosis's response to various nutrient sources, stressors, antibiotics, and virulence events. After the final dataset was obtained, a previously developed ICA algorithm was used to decompose the data into 80 robust iModulons [10] (Figure 1b).…”
Section: Independent Component Analysis Of Publicly Available Data Reveals 80 Transcriptional Modules For M Tuberculosismentioning
confidence: 99%
“…'Uncharacterized' iModulons are those which had little overlap with known TFs or knowledge types, but still contained a significant number of genes. Finally, 'Single Gene' iModulons are those that track the expression of a single gene, and are treated as an artifact of the ICA decomposition [10].…”
Section: Independent Component Analysis Of Publicly Available Data Reveals 80 Transcriptional Modules For M Tuberculosismentioning
confidence: 99%
“…To compute the optimal independent components, an extension of ICA was performed on the RNA-seq dataset as described in McConn et al [10].…”
Mycobacterium tuberculosis is one of the most consequential human bacterial pathogens, posing a serious challenge to 21st century medicine. A key feature of its pathogenicity is its ability to adapt its transcriptional response to environmental stresses through its transcriptional regulatory network (TRN). While many studies have sought to characterize specific portions of the M. tuberculosis TRN, a systems level characterization and analysis of interactions among the controlling transcription factors remains to be achieved. Here, we applied an unsupervised machine learning method to modularize the M. tuberculosis transcriptome and describe the role of transcription factors (TFs) in the TRN. By applying Independent Component Analysis (ICA) to over 650 transcriptomic samples, we obtained 80 independently modulated gene sets known as "iModulons", many of which correspond to known regulons. These iModulons explain 61% of the variance in the organism's transcriptional response. We show that iModulons: 1) reveal the function of previously unknown regulons, 2) describe the transcriptional shifts that occur during environmental changes such as shifting carbon sources, oxidative stress, and virulence events, and 3) identify intrinsic clusters of transcriptional regulons that link several important metabolic systems, including lipid, cholesterol, and sulfur metabolism. This transcriptome-wide analysis of the M. tuberculosis TRN informs future research on effective ways to study and manipulate its transcriptional regulation, and presents a knowledge-enhanced database of all published high-quality RNA-seq data for this organism to date.
Uncovering the structure of the transcriptional regulatory network (TRN) that modulates gene expression in prokaryotes remains an important challenge. Transcriptomics data is plentiful, necessitating the development of scalable methods for converting this data into useful knowledge about the TRN. Previously, we published the PRECISE dataset for Escherichia coli K-12 MG1655, containing 278 RNA-seq datasets created using a standardized protocol. Here, we present PRECISE 2.0, which is nearly three times the size of the original PRECISE dataset and also created using a standardized protocol. We analyze PRECISE 2.0 at multiple scales, demonstrating multiple analytical strategies for extracting knowledge from this dataset. Specifically, we: (1) highlight patterns in gene expression across the dataset; (2) utilize independent component analysis to extract 218 independently modulated groups of genes (iModulons) that describe the TRN at the systems level; (3) demonstrate the utility of iModulons over traditional differential expression analysis; and (4) uncover 6 new potential regulons. Thus, PRECISE 2.0 is a large-scale, high-quality transcriptomics dataset which may be analyzed at multiple scales to yield important biological insights.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.