Modeling the inherent flexibility of the protein backbone as part of computational protein design is necessary to capture the behavior of real proteins and is a prerequisite for the accurate exploration of protein sequence space. We present the results of a broad exploration of sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. A distributed computing architecture has allowed us to generate hundreds of thousands of diverse sequences for a set of 253 naturally occurring proteins, allowing exciting insights into the nature of protein sequence space. Designing to a structural ensemble produces a much greater diversity of sequences than previous studies have reported, and homology searches using profiles derived from the designed sequences against the Protein Data Bank show that the relevance and quality of the sequences is not diminished. The designed sequences have greater overall diversity than corresponding natural sequence alignments, and no direct correlations are seen between the diversity of natural sequence alignments and the diversity of the corresponding designed sequences. For structures in the same fold, the sequence entropies of the designed sequences cluster together tightly. This tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggest that the diversity of designed sequences is primarily determined by a structure's overall fold, and that the designability principle postulated from studies of simple models holds in real proteins. This has important implications for experimental protein design and engineering, as well as providing insight into protein evolution.Keywords: Protein design; sequence space; designability; backbone flexibility; distributed computingThe aim of protein design is to find amino acid sequences that are compatible with specific protein structures. Screening of sequences for compatibility with a protein structure was introduced in the early 1980s, with the definition of the inverse folding problem (Pabo 1983). Whereas protein folding involves finding the native three-dimensional structure for a particular amino acid sequence, the inverse folding problem seeks to define the entire set of sequences that can specifically form a stable protein with some target structure. Protein design, whether experimental, computational, or some hybrid approach, provides important clues towards a solution of the inverse protein folding problem by sampling the sequence space of known protein structures (Pande et al. 1997).An important practical use of protein design is in the stabilization of known protein folds (Dahiyat 1999). The optimization schemes used in most protein design algorithms are written to find local or globally optimized sequences, with the lowest or near-lowest free energy of folding for an existing target structure; much recent work has addressed this topic (Desjarlais and Clarke 1998; ShakhReprint requests to: Vijay S. Pande, Ch...