1992
DOI: 10.1002/pro.5560010313
|View full text |Cite
|
Sign up to set email alerts
|

Selection of representative protein data sets

Abstract: The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
446
0
2

Year Published

1993
1993
2007
2007

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 786 publications
(449 citation statements)
references
References 14 publications
(21 reference statements)
1
446
0
2
Order By: Relevance
“…Redundancy reduction in prediction is common practice in areas such as bioinformatics but has not been applied systematically to data sets in chemoinformatics. To study and avoid the biases introduced by redundant data, we use the algorithm in Hobohm et al 35 to derive redundancy-reduced data sets, by iteratively thinning clusters of molecules with high similarity, until no molecules in the training sets have a similarity greater than some preset threshold (see Results, section 4.7).…”
Section: Training and Optimization For A Given Kernel We Use The E-mentioning
confidence: 99%
“…Redundancy reduction in prediction is common practice in areas such as bioinformatics but has not been applied systematically to data sets in chemoinformatics. To study and avoid the biases introduced by redundant data, we use the algorithm in Hobohm et al 35 to derive redundancy-reduced data sets, by iteratively thinning clusters of molecules with high similarity, until no molecules in the training sets have a similarity greater than some preset threshold (see Results, section 4.7).…”
Section: Training and Optimization For A Given Kernel We Use The E-mentioning
confidence: 99%
“…The new representative list of PDB chain identifiers was produced using "algorithm 2" of Hobohm et al (1992). This algorithm removes redundant protein chains 1 by 1, following a strategy called "greedy" by computer scientists: the chain with the largest number of neighbors is removed, until no neighbors are left.…”
Section: Selection Proceduresmentioning
confidence: 99%
“…Although the wealth of data bears witness to the progress achieved by protein crystallographers and NMR spectroscopists, an overview of the spectrum of known protein structures and certain statistical analyses of protein structures require nonredundant data. To meet this need, we have developed algorithms to select from PDB (or from sequence databases) representative subsets that aim to minimize redundancy and maximize coverage (Hobohm et al, 1992). The result, in the form of a list of PDB identifiers, was published about 1; years ago and has been very useful, e.g., in developing better algorithms for secondary structure prediction (Rost & Sander, 1993).…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…Abola et al, 1987), comprising over 100 nonhomologous structures (Hobohm et al, 1992), it is well known that a number of structural themes recur. One example of such a recurring structure is the Rossman fold (Rossman et al, 1974), found in such proteins as lactate dehydrogenase, glyceraldehyde-3-phosphate dehydrogenase, and alcohol dehydrogenase (Branden & Tooze, 1991).…”
mentioning
confidence: 99%