We have developed a new representation for structural and functional motifs in protein sequences based on correlations between pairs of amino acids and applied it to a-helical and P-sheet sequences. Existing probabilistic methods for representing and analyzing protein sequences have traditionally assumed conditional independence of evidence. In other words, amino acids are assumed to have no effect on each other. However, analyses of protein structures have repeatedly demonstrated the importance of interactions between amino acids in conferring both structure and function. Using Bayesian networks, we are able to model the relationships between amino acids at distinct positions in a protein sequence in addition to the amino acid distributions at each position. We have also developed an automated program for discovering sequence correlations using standard statistical tests and validation techniques. In this paper, we test this program on sequences from secondary structure motifs, namely a-helices and @sheets. In each case, the correlations our program discovers correspond well with known physical and chemical interactions between amino acids in structures. Furthermore, we show that, using different chemical alphabets for the amino acids, we discover structural relationships based on the same chemical principle used in constructing the alphabet. This new representation of 3-dimensional features in protein motifs, such as those arising from structural or functional constraints on the sequence, can be used to improve sequence analysis tools including pattern analysis and database search.Keywords: a-helix structure; amino acid correlations; motif modeling; sequence analysis; side-chain interactions; structure analysis Understanding the 3-dimensional structure of a protein is a necessary and critical step toward understanding the protein's function. For example, only after the structure of hemoglobin was solved was it possible to dissect the mechanisms responsible for the cooperative binding of oxygen, for the effects of pH and 2-3-diphosphoglycerate (DPG) on affinity, and for the defects causing various anemias (Stryer, 1988). Despite the increasing wealth of sequence data, the laborious and time-consuming process of empirical structure determination hampers the availability of detailed structural information. Instead, sequence analysis tools offer the best hope for quickly eliciting structural and functional information from new sequences.Traditional methods for analyzing sequences rely on the prior analyses of known sequences and on procedures for matching sequences. These techniques encompass database search (Wilbur & Lipman, 1983), sequence classification (Klein et al., 1984;Klein & DeLisi, 1986), and analysis for motifs (Bairoch & Boeckmann, 1991;Henikoff & Henikoff, 1991)
~-techniques for both analysis and matching emphasize the conservation of amino acids during evolution. Specifically, one usually assumes that if 2 sequences are homologous, then the amino acids that one observes at corresponding loca...