Gene expression is controlled primarily by transcription factors, whose DNA binding sites are typically 10 nt long. We develop a population-genetic model to understand how the length and information content of such binding sites evolve. Our analysis is based on an inherent trade-off between specificity, which is greater in long binding sites, and robustness to mutation, which is greater in short binding sites. The evolutionary stable distribution of binding site lengths predicted by the model agrees with the empirical distribution (5-31 nt, with mean 9.9 nt for eukaryotes), and it is remarkably robust to variation in the underlying parameters of population size, mutation rate, number of transcription factor targets, and strength of selection for proper binding and selection against improper binding. In a systematic data set of eukaryotic and prokaryotic transcription factors we also uncover strong relationships between the length of a binding site and its information content per nucleotide, as well as between the number of targets a transcription factor regulates and the information content in its binding sites. Our analysis explains these features as well as the remarkable conservation of binding site characteristics across diverse taxa.
MUCH of the phenotypic divergence between species is driven by changes in transcriptional regulation, and especially by point mutations at transcription factor binding sites (Stern 2000;Carroll 2005;Ihmels et al. 2005;Prud'homme et al. 2006Prud'homme et al. , 2007Tsong et al. 2006;Wray 2007;Lemos et al. 2008; Tuch et al. 2008a,b). Such mutations can increase or decrease the affinity of a transcription factor protein to its binding sites, which in turn modifies the expression of regulated genes. Binding sites are typically $10 nt in length, in both eukaryotes and prokaryotes, although this number varies from as few as 5 to .30 nt (Figure 1). Binding sites are also characterized by their information content (D'haeseleer 2006), which is determined by the number of different bases that can occur at each nucleotide and still produce functional binding. The average information content varies from a maximum 2 bits per nucleotide (each nucleotide must assume a specific base to produce functional binding) to ,0.25 bits (each nucleotide can assume one of several bases and still produce functional binding).What determines the length and information content of a transcription factor binding site? Biophysical factors provide some constraints, and numerous studies have explored their effects on the function of individual transcription factor binding sites Berg et al. 2004;Bintu et al. 2005;Shultzaberger et al. 2007;Mustonen et al. 2008;Gerland and Hwa 2009). Natural selection also plays an important role because, to produce the correct patterns of gene expression, transcription factors must correctly bind to some sites in the genome and avoid binding elsewhere (Sengupta et al. 2002;Shultzaberger et al. 2012;. If binding sites are too short, transcription factors bind too readily to...