Short tandem repeats (STRs) have been implicated in a variety of complex traits in humans. However, genome-wide studies of the effects of STRs on gene expression thus far have had limited power to detect associations and elucidate the underlying biological mechanisms. Here, we leverage whole genome sequencing and expression data for 17 tissues from GTEx to identify STRs whose repeat lengths are associated with expression of nearby genes (eSTRs). Our analysis reveals more than 3,000 high-confidence eSTRs, which are enriched in known or predicted regulatory regions. We show eSTRs may act through a variety of mechanisms. We further identify hundreds of eSTRs that potentially drive published GWAS signals and implicate specific eSTRs in height and schizophrenia. Overall, our results demonstrate that eSTRs potentially contribute to a range of human phenotypes. We expect that our comprehensive eSTR catalog will serve as a valuable resource for future studies of complex traits. link between an eSTR for RFT1 and height and use reporter assays to experimentally validate the effect of this STR on expression. Finally, the complete catalog of eSTRs is publicly available and will likely be a valuable resource for future studies of complex traits.
Results
Profiling expression STRs across 17 human tissuesWe performed a genome-wide analysis to identify associations between the number of repeats in each STR and expression of nearby genes (expression STRs, or "eSTRs", which we use to refer to a unique STR by gene association). We focused on 652 samples included in the Genotype Tissue Expression (GTEx) (GTEx Consortium, 2015) dataset for which both high coverage whole genome sequencing (WGS) and RNA-sequencing of multiple tissues were available. The WGS cohort consisted of 561 individuals with reported European ancestry, 75 of African ancestry, and 8, 3, and 5 of Asian, Amerindian, and Unknown ancestry, respectively. We used HipSTR (Willems et al., 2017) to genotype STRs in each sample. Resulting genotypes were subjected to stringent filtering to remove low quality calls ( Methods ). After filtering, 175,226 STRs remained for downstream analysis. To identify eSTRs, we performed a linear regression between average STR length and normalized gene expression for each individual at each STR within 100kb of a gene, controlling for sex, population structure, and technical covariates ( Methods , Figures S1, S2 ). Analysis was restricted to 17 tissues where we had data for at least 100 samples ( Figure 1A, Table S1, Methods ) and to genes with median RPKM greater than 0. As a control, for each STR-gene pair we performed a permutation analysis in which sample identifiers were shuffled. Altogether, we performed an average of 278,521 STR-gene tests across 16,065 genes per tissue.Using this approach, we identified 25,561 unique eSTRs associated with 11,810 genes in at least one tissue at a gene-level FDR of 10% ( Methods ). Of these, 8,417 (32.5%) were shared by two or more tissues and 469 were shared by 10 or more tissues ( Figure S3 ). P-value...