Principal components analysis has been employed in gene expression studies to correct for population substructure and batch and environmental effects. This method typically involves the removal of variation contained in as many as 50 principal components (PCs), which can constitute a large proportion of total variation present in the data. Each PC, however, can detect many sources of variation, including gene expression networks and genetic variation influencing transcript levels. We demonstrate that PCs generated from gene expression data can simultaneously contain both genetic and nongenetic factors. From heritability estimates we show that all PCs contain a considerable portion of genetic variation while nongenetic artifacts such as batch effects were associated to varying degrees with the first 60 PCs. These PCs demonstrate an enrichment of biological pathways, including core immune function and metabolic pathways. The use of PC correction in two independent data sets resulted in a reduction in the number of cis-and transexpression QTL detected. Comparisons of PC and linear model correction revealed that PC correction was not as efficient at removing known batch effects and had a higher penalty on genetic variation. Therefore, this study highlights the danger of eliminating biologically relevant data when employing PC correction in gene expression data.
GENE expression profiling has become a very popular technique used to quantify regulatory changes in messenger (m)RNA expression associated with disease and environmental factors. Gene expression acts as an intermediate phenotype between genotypes and complex traits and is known to act as a modifier to disease susceptibility (Nica and Dermitzakis 2008;Li et al. 2012). Genetic variation underlying gene expression levels has been well established and reported within the literature, with the transcript levels for the majority of genes being heritable to some degree (Price et al. 2011;Grundberg et al. 2012;Powell et al. 2012b).Microarray technology can simultaneously capture the expression of thousands of transcripts within an individual.However, these arrays are sensitive to environmental or experimental perturbations, for example due to different laboratory technicians and reagents (Churchill 2002;Irizarry et al. 2005), microarray chip and chip position (Luo et al. 2010), temperature (Scherer 2009), and even ozone levels (Thomas et al. 2003). These effects can constitute a substantial proportion of variance within a data set (Leek et al. 2010).Normalization strategies have become standard in gene expression studies to correct for nonnormal distributions and inconsistencies between arrays ( Allison et al. 2006). However, normalization techniques do not control for batch effects caused by technical artifacts. These batch effects require additional correction techniques (Scherer 2009) and failure to do so has led to spurious associations (Spielman and Cheung 2007;Baggerly et al. 2008).Many different correction and normalization techniques are currently us...