Optimal dimensionality selection for independent component analysis of transcriptomic data

Jl, McConn; Cr, Lamoureux; Poudel, Saugat; Kim, Jaehyung; Sastry, Anand V.

doi:10.1101/2021.05.26.445885

Cited by 10 publications

(13 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Five additional iModulons were dominated by a single, high-coefficient gene, and are automatically identified by the method find_single_gene_imodulons . These Single Gene (SG) iModulons may arise from over-decomposition of the dataset 30,37 or artificial knock-out or overexpression of single genes. Together, these iModulons contribute to 1% of the variance in the dataset.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks

Sastry

Poudel

Rychel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible and can be employed to support scientific research or build a holistic view of an organism. Here, we introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism's transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. The resulting reconstruction of the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets in the context of all published data. The results are hosted at https://imodulondb.org/, and additional analyses can be performed using the PyModulon Python package. As the number of publicly available datasets increases, this pipeline will be applicable to a wide range of microbial pathogens and cell factories.

show abstract

Section: Resultsmentioning

confidence: 99%

“…To compute the optimal independent components, an extension of ICA was performed on the RNA-seq dataset as described in McConn et al 37 .…”

Section: Methodsmentioning

confidence: 99%

Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks

Sastry

Poudel

Rychel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The final dataset was composed of 657 samples, spanning various conditions that describe M. tuberculosis's response to various nutrient sources, stressors, antibiotics, and virulence events. After the final dataset was obtained, a previously developed ICA algorithm was used to decompose the data into 80 robust iModulons [10] (Figure 1b).…”

Section: Independent Component Analysis Of Publicly Available Data Reveals 80 Transcriptional Modules For M Tuberculosismentioning

confidence: 99%

“…'Uncharacterized' iModulons are those which had little overlap with known TFs or knowledge types, but still contained a significant number of genes. Finally, 'Single Gene' iModulons are those that track the expression of a single gene, and are treated as an artifact of the ICA decomposition [10].…”

Section: Independent Component Analysis Of Publicly Available Data Reveals 80 Transcriptional Modules For M Tuberculosismentioning

confidence: 99%

See 1 more Smart Citation

Machine learning of all Mycobacterium tuberculosis H37Rv RNA-seq data reveals a structured interplay between metabolism, stress response, and infection

Yoo

Rychel

Poudel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Mycobacterium tuberculosis is one of the most consequential human bacterial pathogens, posing a serious challenge to 21st century medicine. A key feature of its pathogenicity is its ability to adapt its transcriptional response to environmental stresses through its transcriptional regulatory network (TRN). While many studies have sought to characterize specific portions of the M. tuberculosis TRN, a systems level characterization and analysis of interactions among the controlling transcription factors remains to be achieved. Here, we applied an unsupervised machine learning method to modularize the M. tuberculosis transcriptome and describe the role of transcription factors (TFs) in the TRN. By applying Independent Component Analysis (ICA) to over 650 transcriptomic samples, we obtained 80 independently modulated gene sets known as "iModulons", many of which correspond to known regulons. These iModulons explain 61% of the variance in the organism's transcriptional response. We show that iModulons: 1) reveal the function of previously unknown regulons, 2) describe the transcriptional shifts that occur during environmental changes such as shifting carbon sources, oxidative stress, and virulence events, and 3) identify intrinsic clusters of transcriptional regulons that link several important metabolic systems, including lipid, cholesterol, and sulfur metabolism. This transcriptome-wide analysis of the M. tuberculosis TRN informs future research on effective ways to study and manipulate its transcriptional regulation, and presents a knowledge-enhanced database of all published high-quality RNA-seq data for this organism to date.

show abstract

A multi-scale transcriptional regulatory network knowledge base forEscherichia coli

Sastry

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Uncovering the structure of the transcriptional regulatory network (TRN) that modulates gene expression in prokaryotes remains an important challenge. Transcriptomics data is plentiful, necessitating the development of scalable methods for converting this data into useful knowledge about the TRN. Previously, we published the PRECISE dataset for Escherichia coli K-12 MG1655, containing 278 RNA-seq datasets created using a standardized protocol. Here, we present PRECISE 2.0, which is nearly three times the size of the original PRECISE dataset and also created using a standardized protocol. We analyze PRECISE 2.0 at multiple scales, demonstrating multiple analytical strategies for extracting knowledge from this dataset. Specifically, we: (1) highlight patterns in gene expression across the dataset; (2) utilize independent component analysis to extract 218 independently modulated groups of genes (iModulons) that describe the TRN at the systems level; (3) demonstrate the utility of iModulons over traditional differential expression analysis; and (4) uncover 6 new potential regulons. Thus, PRECISE 2.0 is a large-scale, high-quality transcriptomics dataset which may be analyzed at multiple scales to yield important biological insights.

show abstract

Optimal dimensionality selection for independent component analysis of transcriptomic data

Cited by 10 publications

References 24 publications

Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks

Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks

Machine learning of all Mycobacterium tuberculosis H37Rv RNA-seq data reveals a structured interplay between metabolism, stress response, and infection

A multi-scale transcriptional regulatory network knowledge base forEscherichia coli

Contact Info

Product

Resources

About