Recognition of related proteins by iterative template refinement (ITR)

Yi, Tongxun; Lander, Eric S.

doi:10.1002/pro.5560030818

Cited by 46 publications

(23 citation statements)

References 45 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The strategy is automated readily, and might be combined with iterative searching methods (Tatusov et al, 1994;Yi & Lander, 1994) to provide an even more powerful automated system. Embedding strategies are general, and therefore can be extended to other situations in which motif-based alignment information is available for a group of sequences.…”

Section: Discussionmentioning

confidence: 99%

Embedding strategies for effective use of information from multiple sequence alignments

Henikoff

1997

Protein Science

View full text Add to dashboard Cite

We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.Keywords: consensus sequence; homology searching; multiple sequence alignment; protein blocks; sequence databanks Improvements in the efficiency of large-scale DNA sequencing are leading to rapid increases in the number of databank sequences that lack genetic or biochemical documentation. This is clearly the case for databases of cDNA sequence fragments (Boguski et al., 1993), which are thought to represent the majority of all human protein sequences, and for databases from large genome sequencing projects, such as the sequencing of uncharacterized bacterial genomes (Nowak, 1995). Matching these unknown sequences with sequences of known function is a major goal of genome research. Meanwhile, there remains the traditional goal of detecting homologues to help understand the function of a protein of interest to a biologist. Improved methods for detecting homology in database searches aid in achieving both goals.It is widely assumed that homology detection can be improved by utilizing multiple alignment information. Either a single sequence query is used to search for homologues in a database of multiple sequence alignments (Henikoff & Henikoff, 1991;Attwood & Beck, 1994;Sonnhammer & Kahn, 1994) or patterns (Smith & Smith, 1990;Bairoch, 1992), or an alignment or pattern query is used to search a sequence database (Gribskov et al., 1987;Henikoff et al., 1990;Krogh et al., 1994;Neuwald & Green, 1994;Tatusov et al., 1994;Thompson et al., 1994b (Gribskov et al., 1990; Krogh et ah., 1994;Eddy, 1996). In either case, position-specific scoring matrices (PSSMs) can represent all available information in a multiple sequence alignment, and several improvements in constructing PSSMs have been introduced recently (Brown et al., 1993;Tatusov et al., 1994;Bailey & Gribskov, 1996;Henikoff & Henikoff, 1996;Sjolander et al., 1996). However, there are no comprehensive evaluation studies that demonstrate the superiority of any multiple alignment-based querying method over single sequence querying methods such as BLAST ), FASTA (Pearson, 1990), and Smith-Waterman (Smith & Waterman...

show abstract

Section: Discussionmentioning

confidence: 99%

Embedding strategies for effective use of information from multiple sequence alignments

Henikoff

1997

Protein Science

View full text Add to dashboard Cite

show abstract

“…In attempts to overcome this limitation, matching methods have been developed that use the features present in multiple aligned sequences of protein families. Examples of such work are sequence templates (Taylor, 1986;Bashford et al, 1987;Tatusov et al, 1994;Yi & Lander, 1994), pro®les (Gribskov et al, 1987;Luthy et al, 1994;Thompson et al, 1994) and hidden Markov models (Krogh et al, 1994;Baldi et al, 1994;Eddy, 1995;Eddy et al, 1995). The problem with these procedures is that (i) multiple divergent sequences are required for signi®cant improvements over what is found from single sequence searches, (ii) the accurate alignment of related sequences with low residue identities involves some expertise, and (iii) the scoring schemes for models based on multiple sequence alignments do not at present give thresholds that de®ne high coverage and low error.…”

mentioning

confidence: 98%

Intermediate sequences increase the detection of homology between sequences

Park¹,

Teichmann

Hubbard

et al. 1997

Journal of Molecular Biology

207

163

View full text Add to dashboard Cite

“…More recently, advanced sequence comparison methods have been developed on the basis of shared features of sets of related sequences such as protein families. Examples of such approaches are templates (12,13), profiles (14)(15)(16), hidden Markov models [HMMs; (17,18)], and PSI-BLAST [position-specific iterated BLAST (19)]. In addition, threading algorithms are also intended to improve detection of homologous pairs from the sequence space in the twilight zone.…”

mentioning

confidence: 99%

Phylogenetic profiles reveal evolutionary relationships within the “twilight zone” of sequence similarity

Chang¹,

Hong²,

Ko³

et al. 2008

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

Inferring evolutionary relationships among highly divergent protein sequences is a daunting task. In particular, when pairwise sequence alignments between protein sequences fall <25% identity, the phylogenetic relationships among sequences cannot be estimated with statistical certainty. Here, we show that phylogenetic profiles generated with the Gestalt Domain Detection Algorithm-Basic Local Alignment Tool (GDDA-BLAST) are capable of deriving, ab initio, phylogenetic relationships for highly divergent proteins in a quantifiable and robust manner. Notably, the results from our computational case study of the highly divergent family of retroelements accord with previous estimates of their evolutionary relationships. Taken together, these data demonstrate that GDDA-BLAST provides an independent and powerful measure of evolutionary relationships that does not rely on potentially subjective sequence alignment. We demonstrate that evolutionary relationships can be measured with phylogenetic profiles, and therefore propose that these measurements can provide key insights into relationships among distantly related and/or rapidly evolving proteins.ab initio ͉ retroelements ͉ reverse transcriptase ͉ GDDA-BLAST T he ''protein problem'' has remained unsolved despite decades of research (1, 2). In principle, one expects that the primary amino acid sequence of a protein determines its structure, function, and evolutionary (SF&E) characteristics. Yet, there still is no reliable method for predicting the native state structure of a protein and its function given only its sequence. In addition, inferring the evolutionary relationships among highly divergent protein and/or rapidly evolving sequences is a daunting task. In general, when pairwise sequence alignments between protein sequences fall below Ϸ25% identity (i.e., the ''twilight zone''), the assignment of positional homology is so difficult that it becomes impossible to safely estimate phylogenetic relationships (1, 3, 4). However, a small number of conserved residues (Ϸ8% identity) can coordinate the 3-D fold and/or function of proteins (5-7). Conversely, two proteins that share 88% identity can still retain independent structure and function (8).The aforementioned studies point out that quantitatively measuring data spaces in the protein world (i.e., the sequence, structure, and functional space that proteins occupy) is a fundamental question facing evolutionary/computational biologists, with further questions arising. Is there any equation that quantitatively connects these protein spaces to protein evolution? Which residues within amino acid sequences best reflect the evolutionary history of a given protein? Do proteins with similar sequence and structure necessarily share a common ancestor? Furthermore, if sequence and structure similarity suggest an evolutionary history, can weak similarities be strengthened by functional connections? All of these questions are essentially connected to the protein data space; however, to date they have not been clearly solved either ...

show abstract

Recognition of related proteins by iterative template refinement (ITR)

Cited by 46 publications

References 45 publications

Embedding strategies for effective use of information from multiple sequence alignments

Embedding strategies for effective use of information from multiple sequence alignments

Intermediate sequences increase the detection of homology between sequences

Phylogenetic profiles reveal evolutionary relationships within the “twilight zone” of sequence similarity

Contact Info

Product

Resources

About