Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences.Comparative analysis of primary sequence information is a major tool in the elucidation of the molecular mechanisms of replication and evolution of organisms and the structure and function of proteins. For the simple case of pairwise sequence comparison, good algorithms exist (see refs. 1 and 2 for recent reviews) that can align two sequences of length N in roughly O(N2) steps. Most of these algorithms are based on dynamic programming (3), with location-independent substitution and gap penalties. Unfortunately, when dynamic programming is applied to a family of K sequences its behavior scales like O(NK), exponentially in the number of sequences (4).A number of algorithms have been devised to try to tackle the multiple alignment problem (see refs. 5-7 for some of the most recent ones). Most protein sequence relationships exhibiting >50%o identical residues can be aligned by several of these algorithms. Many of the most interesting protein families, however, exhibit conservation far below 50%o identity.To date, alignment methods have not been developed that can correctly identify all the motifs that define each protein family (2).Here, we apply a different approach, based on hidden Markov models (HMMs), to the problem of modeling and aligning a family by using primary structure information only. Initial results were presented (8). Markov models and the related expectation-maximization (EM) (9) algorithm in statistics have already been applied to biocomputational problems (10-13). Krogh et al. (14) were the first to demonstrate the power of a similar method on the globin family. Rather than starting from pairwise alignments, the approach seeks to take advantage of the massive amount of information typically present in a family with a flexible use of positiondependent parameters. A new algorithm is introduced for the iterative adjustments of the parameters of the models. The algorithm is used here to model three protein families:globins, immunoglobulins, and kinases.tt HMMs and Learning A first-order discrete HMM (15) is completely defined by a set of states S, an alphabet of m symbols, a probability transition matrix T = (tv), and a probability emission matrix E = (eta). When the system is in state i, it has a probability t(/ of moving to state] and a probability eia of emitting symbol a. Only the output s...
Experimental allergic encephalomyelitis (EAE) is induced by T cell-mediated immunity to central nervous system antigens. In H-2u mice, EAE is mediated primarily by T cells specific for residues 1-11 of myelin basic protein (MBP). We demonstrate that differential tolerance to MBP1-11 versus epitopes in MBP121-150 is induced by expression of endogenous MBP, reflecting extreme differences in stability of peptide/MHC complexes. The diverse MBP121-150-specific TCR repertoire can be divided into three fine specificity groups. Two groups were identified in wild-type mice despite extensive tolerance, but the third group was not detected. Activated MBP121-150-specific T cells induce EAE in wild-type mice. Thus, encephalitogenic T cells that escape tolerance either recognize short-lived peptide/MHC complexes or express TCRs with unique specificities for stable complexes.
Only 10 different V beta gene segments were found when the sequences of 15 variable (V beta) genes of the mouse T-cell receptor were examined. From this analysis we calculate that the total number of expressed V beta gene segments may be 21 or fewer, which makes the expressed germline V beta repertoire much smaller than that of immunoglobulin heavy-chain or light-chain genes. We suggest that beta-chain somatic diversification is concentrated at the V beta-D beta-J beta junctions.
DNA sequence analysis is a multistage process that includes the preparation of DNA, its fragmentation and base analysis, and the interpretation of the resulting sequence information. New technological advances have led to the automation of certain steps in this process and have raised the possibility of large-scale DNA sequencing efforts in the near future [for example, 1 million base pairs (Mb) per year]. New sequencing methodologies, fully automated instrumentation, and improvements in sequencing-related computational resources may render genome-size sequencing projects (100 Mb or larger) feasible during the next 5 to 10 years.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.