Computational studies of the relationships between protein sequence, structure, and folding have traditionally relied on purely local sequence representations. Here we show that global representations, on the basis of parameters that encode information about complete sequences, contain otherwise inaccessible information about the organization of sequences. By studying the spectral properties of these parameters, we demonstrate that amino acid physical properties fall into two distinct classes. One class is comprised of properties that favor sequentially localized interaction clusters. The other class is comprised of properties that favor globally distributed interactions. This observation provides a bridge between two classic models of protein folding-the collapse model and the nucleation model-and provides a basis for understanding how any degree of intermediacy between these two extremes can occur.proteomics | sequence analysis B ioinformatic studies of protein sequences have concentrated almost exclusively on their local properties. The relationship between local sequence properties and local folding has been extensively examined. Sequence homology studies have concentrated on developing methods for establishing local equivalences between corresponding residues in pairs of sequences. It has become increasingly clear, however, that a purely local view of protein sequences is not adequate. In a number of recent studies (1-6), we have demonstrated quantitatively that there are intrinsic limitations to the informatic power of local descriptions of protein sequence, particularly with respect to the encoding of structural information. It is clear, however, that sequence does completely determine protein structure, and it therefore follows that folding instructions must be encrypted in global, rather than local, sequence information. In the present work, we discuss some fundamental global properties of protein sequences and examine their implications for mechanisms of protein folding.
ModelA necessary preliminary to any meaningful discussion of sequence characteristics is the conversion of protein sequences into a numerical form amenable to systematic analysis. We follow a procedure set forth in previous work (7-9), by using the 10 Kidera property factors (10, 11) , which form an orthonormal and essentially complete basis set for the known physical properties of the amino acids, to represent an amino acid as a 10-vector. (The Kidera factors are given in Table 1.) A complete protein sequence is then represented by a set of 10 N-member numerical strings, each of which records the course of one property factor along the N-residue sequence. These strings can be Fourier transformed, leading to a representation of the sequence by a set of sine and cosine Fourier coefficients. Each of these coefficients, which is labeled by a wave number k and a property identifier l, encodes information about the entire sequence of the protein. Furthermore, the Fourier components are determined by information associated with different int...