BackgroundIt has been long well known that genes do not act alone; rather groups of genes act in consort during a biological process. Consequently, the expression levels of genes are dependent on each other. Experimental techniques to detect such interacting pairs of genes have been in place for quite some time. With the advent of microarray technology, newer computational techniques to detect such interaction or association between gene expressions are being proposed which lead to an association network. While most microarray analyses look for genes that are differentially expressed, it is of potentially greater significance to identify how entire association network structures change between two or more biological settings, say normal versus diseased cell types.ResultsWe provide a recipe for conducting a differential analysis of networks constructed from microarray data under two experimental settings. At the core of our approach lies a connectivity score that represents the strength of genetic association or interaction between two genes. We use this score to propose formal statistical tests for each of following queries: (i) whether the overall modular structures of the two networks are different, (ii) whether the connectivity of a particular set of "interesting genes" has changed between the two networks, and (iii) whether the connectivity of a given single gene has changed between the two networks. A number of examples of this score is provided. We carried out our method on two types of simulated data: Gaussian networks and networks based on differential equations. We show that, for appropriate choices of the connectivity scores and tuning parameters, our method works well on simulated data. We also analyze a real data set involving normal versus heavy mice and identify an interesting set of genes that may play key roles in obesity.ConclusionsExamining changes in network structure can provide valuable information about the underlying biochemical pathways. Differential network analysis with appropriate connectivity scores is a useful tool in exploring changes in network structures under different biological conditions. An R package of our tests can be downloaded from the supplementary website http://www.somnathdatta.org/Supp/DNA.
Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (<70%) and a slightly higher actual FDR than our proposed methods. In addition, the results from an analysis based on the qualitative characteristics of sample distribution for MAQC2 and human breast cancer datasets show that only our gene-wise normalization methods corrected data skewed towards lower read counts. However, when we evaluated MAQC3 with less variation in five replicates, all methods performed similarly. Thus, our proposed Med-pgQ2 and UQ-pgQ2 methods perform slightly better for differential gene analysis of RNA-seq data skewed towards lowly expressed read counts with high variation by improving specificity while maintaining a good detection power with a control of the nominal FDR level.
Next generation sequencing has revolutionized the status of biological research. For a long time, the gold standard of DNA sequencing was considered to be the Sanger method. However, in 2005, commercial launching of next generation sequencing has made it possible to generate massively parallel and high resolution DNA sequence data. Its usefulness in various genomic applications such as genome-wide detection of SNPs, DNA methylation profiling, mRNA expression profiling, whole-genome re-sequencing and so on are now well recognized. There are several platforms for generating next generation sequencing (NGS) data which we briefly discuss in this mini overview. With new technologies come new challenges for the data analysts. This mini review attempts to present a collection of selected topics in the current development of statistical methods dealing with these novel data types. We believe that knowing the advances and bottlenecks of this technology will help the researchers to benchmark the analytical tools dealing with these data and will pave the path for its proper application into clinical diagnostics.
BackgroundChanges in microRNA (miRNA) expression patterns have been extensively characterized in several cancers, including human colon cancer. However, how these miRNAs and their putative mRNA targets contribute to the etiology of cancer is poorly understood. In this work, a bioinformatics computational approach with miRNA and mRNA expression data was used to identify the putative targets of miRNAs and to construct association networks between miRNAs and mRNAs to gain some insights into the underlined molecular mechanisms of human colon cancer.MethodThe miRNA and mRNA microarray expression profiles from the same tissues including 7 human colon tumor tissues and 4 normal tissues, collected by the Broad Institute, were used to identify significant associations between miRNA and mRNA. We applied the partial least square (PLS) regression method and bootstrap based statistical tests to the joint expression profiles of differentially expressed miRNAs and mRNAs. From this analysis, we predicted putative miRNA targets and association networks between miRNAs and mRNAs. Pathway analysis was employed to identify biological processes related to these miRNAs and their associated predicted mRNA targets.ResultsMost significantly associated up-regulated mRNAs with a down-regulated miRNA identified by the proposed methodology were considered to be the miRNA targets. On average, approximately 16.5% and 11.0% of targets predicted by this approach were also predicted as targets by the common prediction algorithms TargetScan and miRanda, respectively. We demonstrated that our method detects more targets than a simple correlation based association. Integrative mRNA:miRNA predictive networks from our analysis were constructed with the aid of Cytoscape software. Pathway analysis validated the miRNAs through their predicted targets that may be involved in cancer-associated biological networks.ConclusionWe have identified an alternative bioinformatics approach for predicting miRNA targets in human colon cancer and for reverse engineering the miRNA:mRNA network using inversely related mRNA and miRNA joint expression profiles. We demonstrated the superiority of our predictive method compared to the correlation based target prediction algorithm through a simulation study. We anticipate that the unique miRNA targets predicted by the proposed method will advance the understanding of the molecular mechanism of colon cancer and will suggest novel therapeutic targets after further experimental validations.
The high throughput RNA sequencing (RNA-seq) technology has become the popular method of choice for transcriptomics and the detection of differentially expressed genes. Sample size calculations for RNA-seq experimental design are an important consideration in biological research and clinical trials. Currently, the sample size formulas derived from the Wald and the likelihood ratio statistical tests with a Poisson distribution to model RNA-seq data have been developed. However, since the mean read counts in the real RNA-seq data are not equal to the variance, an extended method to calculate sample sizes based on a negative binomial distribution using an exact test statistic was proposed by . In this study, we alternatively derive five sample size calculation methods based on the negative binomial distribution using the Wald test, the log-transformed Wald test and the log-likelihood ratio test statistics. A comparison of our five methods and an existing method was performed by calculating the sample sizes and the simulated power in different scenarios. We first calculated the sample sizes for testing a single gene using the six methods given a nominal significance level α at 0.05 and 80% power. Then, we calculated the sample sizes for testing multiple genes given a false discovery rate (FDR) at 0.05 and 0.10. The empirical power and true prognostic genes for differential gene expression analysis corresponding to the estimated sample sizes from the six methods are also estimated via the simulation studies. Using the sample size formulas derived from log-transformed and Wald-based tests, we observed smaller sample properties while maintaining the nominal power close to or higher than 80% in all the settings compared to other methods. Moreover, the Wald test based sample size calculation method is easier to compute and faster in an RNA-seq experimental design. Later, several sample size calculation methods that were derived from the score statistic and the log-likelihood ratio test (LRT) statistic using the Poisson distribution were proposed [16]. However, the assumption of a Poisson distribution that the expected mean and variance are equal usually does not hold for RNA-seq studies, where the variance is typically greater than the mean of the read counts [17]. Therefore, a negative binomial distribution with a dispersion parameter is used to model RNA-seq data by the existing software packages such as DESeq [17] and edgeR [18], in which an exact test is used to test DEGs between conditions. Subsequently, a sample size calculation method based on an exact test statistic with the aid of the edgeR package [18] was proposed [19]. However, sample size methods derived from other test statistics such as the Wald test, the LRT and an extension of Wald test via log-transformation using negative binomial distribution to model the RNA-seq data have not yet been explored.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.