Alterations in DNA methylation (DNAm) in cancer have been known for 25 years, including hypomethylation of oncogenes and hypermethylation of tumor suppressor genes1. However, most studies of cancer methylation have assumed that functionally important DNAm will occur in promoters, and that most DNAm changes in cancer occur in CpG islands2,3. Here we show that most methylation alterations in colon cancer occur not in promoters, and also not in CpG islands but in sequences up to 2 kb distant which we term “CpG island shores.” CpG island shore methylation was strongly related to gene expression, and it was highly conserved in mouse, discriminating tissue types regardless of species of origin. There was a surprising overlap (45-65%) of the location of colon cancer-related methylation changes with those that distinguished normal tissues, with hypermethylation enriched closer to the associated CpG islands, and hypomethylation enriched further from the associated CpG island and resembling non-colon normal tissues. Thus, methylation changes in cancer are at sites that vary normally in tissue differentiation, and they are consistent with the epigenetic progenitor model of cancer4, that epigenetic alterations affecting tissue-specific differentiation are the predominant mechanism by which epigenetic changes cause cancer.
High density oligonucleotide expression arrays are widely used in many areas of biomedical research. Affymetrix GeneChip arrays are the most popular. In the Affymetrix system, a fair amount of further pre-processing and data reduction occurs following the image processing * Zhijin Wu is graduate student and Rafael A. Irizarry is Associate Professor of Biostatistics (E-mail: These arrays use short oligonucleotides to probe for genes in an RNA sample. Typically each gene will be represented by 11-20 pairs of oligonucleotide probes. The first component of these pairs is referred to as a perfect match probe and is designed to hybridize only with transcripts from the intended gene (specific hybridization). However, hybridization by other sequences (non-specific hybridization) is unavoidable. Furthermore, hybridization strengths are measured by a scanner that introduces optical noise. Therefore, the observed intensities need to be adjusted to give accurate measurements of specific hybridization. We have found that the default adhoc adjustment, provided as part of the Affymetrix system, can be improved via the use of estimators derived from a statistical model that uses probe sequence information.A final step in pre-processing is to summarize the probe-level data for each gene to define a measure of expression that represents the amount of the corresponding mRNA species. In this paper we illustrate the practical consequences of not adjusting appropriately for the presence of non-specific hybridization and provide a solution based on our background adjustment procedure. Software that computes our adjustment is available as part of the Bioconductor project
The ability to measure gene expression on a genome-wide scale is one of the most promising accomplishments in molecular biology. Microarrays, the technology that first permitted this, were riddled with problems due to unwanted sources of variability. Many of these problems are now mitigated, after a decade's worth of statistical methodology development. The recently developed RNA sequencing (RNA-seq) technology has generated much excitement in part due to claims of reduced variability in comparison to microarrays. However, we show that RNA-seq data demonstrate unwanted and obscuring variability similar to what was first observed in microarrays. In particular, we find guanine-cytosine content (GC-content) has a strong sample-specific effect on gene expression measurements that, if left uncorrected, leads to false positives in downstream results. We also report on commonly observed data distortions that demonstrate the need for data normalization. Here, we describe a statistical methodology that improves precision by 42% without loss of accuracy. Our resulting conditional quantile normalization algorithm combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content and quantile normalization to correct for global distortions.
Phosphorus (P) is a critical driver of phytoplankton growth and ecosystem function in the ocean. Diatoms are an abundant class of marine phytoplankton that are responsible for significant amounts of primary production. With the control they exert on the oceanic carbon cycle, there have been a number of studies focused on how diatoms respond to limiting macro and micronutrients such as iron and nitrogen. However, diatom physiological responses to P deficiency are poorly understood. Here, we couple deep sequencing of transcript tags and quantitative proteomics to analyze the diatom Thalassiosira pseudonana grown under P-replete and P-deficient conditions. A total of 318 transcripts were differentially regulated with a false discovery rate of <0.05, and a total of 136 proteins were differentially abundant ( p <0.05). Significant changes in the abundance of transcripts and proteins were observed and coordinated for multiple biochemical pathways, including glycolysis and translation. Patterns in transcript and protein abundance were also linked to physiological changes in cellular P distributions, and enzyme activities. These data demonstrate that diatom P deficiency results in changes in cellular P allocation through polyphosphate production, increased P transport, a switch to utilization of dissolved organic P through increased production of metalloenzymes, and a remodeling of the cell surface through production of sulfolipids. Together, these findings reveal that T. pseudonana has evolved a sophisticated response to P deficiency involving multiple biochemical strategies that are likely critical to its ability to respond to variations in environmental P availability.
Supplementary data are available at Bioinformatics online.
We have developed a graphical tool to evaluate summaries of Affymetrix probe level data. Plots and summary statistics offer a picture of how an expression measure performs in several important areas. This picture facilitates the comparison of competing expression measures and the selection of methods suitable for a specific investigation. The key is a benchmark data set consisting of a dilution study and a spike-in study. Because the truth is known for these data, we can identify statistical features of the data for which the expected outcome is known in advance. Those features highlighted in our suite of graphs are justified by questions of biological interest and motivated by the presence of appropriate data.
Recent developments in RNA-sequencing (RNA-seq) technology have led to a rapid increase in gene expression data in the form of counts. RNA-seq can be used for a variety of applications, however, identifying differential expression (DE) remains a key task in functional genomics. There have been a number of statistical methods for DE detection for RNA-seq data. One common feature of several leading methods is the use of the negative binomial (Gamma–Poisson mixture) model. That is, the unobserved gene expression is modeled by a gamma random variable and, given the expression, the sequencing read counts are modeled as Poisson. The distinct feature in various methods is how the variance, or dispersion, in the Gamma distribution is modeled and estimated. We evaluate several large public RNA-seq datasets and find that the estimated dispersion in existing methods does not adequately capture the heterogeneity of biological variance among samples. We present a new empirical Bayes shrinkage estimate of the dispersion parameters and demonstrate improved DE detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.