Gene expression levels are dynamic molecular phenotypes that respond to biological, environmental, and technical perturbations. Here we use a novel replicate-classifier approach for discovering transcriptional signatures and apply it to the GenotypeTissue Expression data set. We identified many factors contributing to expression heterogeneity, such as collection center and ischemia time, and our approach of scoring replicate classifiers allows us to statistically stratify these factors by effect strength. Strikingly, from transcriptional expression in blood alone we detect markers that help predict heart disease and stroke in some patients. Our results illustrate the challenges and opportunities of interpreting patterns of transcriptional variation in large-scale data sets.KEYWORDS GTEx Consortium; gene expression normalization; Random Forest classification; transcriptional heterogeneity U NLIKE previous large-scale tissue- (FANTOM Consortium et al. 2015) or cell type-(ENCODE Project Consortium 2012) specific expression data sets, the Genotype-Tissue Expression (GTEx) project (GTEx Consortium 2015) is unique in the breadth of tissue types sampled from the same individuals. The GTEx Consortium has previously demonstrated that tissue-specific gene expression signatures are preserved in postmortem samples using hierarchical clustering (Melé et al. 2015), which groups samples by gene expression using a datadriven approach to identify hidden structure in the data. While hierarchical clustering is effective at identifying the greatest global source of variation, it does not capture more subtle sources of variation. For example, in the context of the GTEx project, hierarchical clustering largely captures gene expression variation due to tissue type, but less effectively captures the influence of confounding factors like age or sex.Using the GTEx pilot data freeze version 4, we attempted to recapitulate the results of hierarchical clustering using supervised Random Forest (RF) classification (Breiman 2001). Unlike hierarchical clustering, RF uses sample type annotations in a training data set to create decision trees, where the nodes correspond to genes whose expression levels distinguish between tissue types. Although RF classification typically considers a single classifier per classification task, we randomly generated replicate classifiers to statistically assess how well two groups can be distinguished. This approach is markedly distinct from hierarchical clustering or principal component analysis and enables statistical uncertainty to be rigorously quantified. These analyses reveal strong transcriptional signatures that contribute to patterns of expression heterogeneity in the GTEx data. More broadly, our results highlight that a deeper understanding of the determinants of transcriptional variation enable insights into the biological factors that govern variation in gene expression among tissues and individuals.
Materials and Methods
Normalization and data curatingWe first removed samples of non-Europea...