Selection of representative protein data sets

Hobohm, Uwe; Scharf, Michael E.; Schneider, Reinhard; Sander, Chris

doi:10.1002/pro.5560010313

Cited by 786 publications

(449 citation statements)

References 14 publications

(21 reference statements)

Supporting

Mentioning

446

Contrasting

Unclassified

Order By: Relevance

“…Redundancy reduction in prediction is common practice in areas such as bioinformatics but has not been applied systematically to data sets in chemoinformatics. To study and avoid the biases introduced by redundant data, we use the algorithm in Hobohm et al 35 to derive redundancy-reduced data sets, by iteratively thinning clusters of molecules with high similarity, until no molecules in the training sets have a similarity greater than some preset threshold (see Results, section 4.7).…”

Section: Training and Optimization For A Given Kernel We Use The E-mentioning

confidence: 99%

One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties

Azencott

Ksikes

Swamidass

et al. 2007

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Many chemoinformatics applications, including high-throughput virtual screening, benefit from being able to rapidly predict the physical, chemical, and biological properties of small molecules to screen large repositories and identify suitable candidates. When training sets are available, machine learning methods provide an effective alternative to ab initio methods for these predictions. Here, we leverage rich molecular representations including 1D SMILES strings, 2D graphs of bonds, and 3D coordinates to derive efficient machine learning kernels to address regression problems. We further expand the library of available spectral kernels for small molecules developed for classification problems to include 2.5D surface and 3D kernels using Delaunay tetrahedrization and other techniques from computational geometry, 3D pharmacophore kernels, and 3.5D or 4D kernels capable of taking into account multiple molecular configurations, such as conformers. The kernels are comprehensively tested using cross-validation and redundancy-reduction methods on regression problems using several available data sets to predict boiling points, melting points, aqueous solubility, octanol/water partition coefficients, and biological activity with state-of-the art results. When sufficient training data are available, 2D spectral kernels in general tend to yield the best and most robust results, better than state-of-the art. On data sets containing thousands of molecules, the kernels achieve a squared correlation coefficient of 0.91 for aqueous solubility prediction and 0.94 for octanol/water partition coefficient prediction. Averaging over conformations improves the performance of kernels based on the three-dimensional structure of molecules, especially on challenging data sets. Kernel predictors for aqueous solubility (kSOL), LogP (kLOGP), and melting point (kMELT) are available over the Web through: http:// cdb.ics.uci.edu.

show abstract

Section: Training and Optimization For A Given Kernel We Use The E-mentioning

confidence: 99%

One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties

Azencott

Ksikes

Swamidass

et al. 2007

J. Chem. Inf. Model.

View full text Add to dashboard Cite

show abstract

“…The new representative list of PDB chain identifiers was produced using "algorithm 2" of Hobohm et al (1992). This algorithm removes redundant protein chains 1 by 1, following a strategy called "greedy" by computer scientists: the chain with the largest number of neighbors is removed, until no neighbors are left.…”

Section: Selection Proceduresmentioning

confidence: 99%

“…Although the wealth of data bears witness to the progress achieved by protein crystallographers and NMR spectroscopists, an overview of the spectrum of known protein structures and certain statistical analyses of protein structures require nonredundant data. To meet this need, we have developed algorithms to select from PDB (or from sequence databases) representative subsets that aim to minimize redundancy and maximize coverage (Hobohm et al, 1992). The result, in the form of a list of PDB identifiers, was published about 1; years ago and has been very useful, e.g., in developing better algorithms for secondary structure prediction (Rost & Sander, 1993).…”

mentioning

confidence: 99%

“…The current "25%-list" (based on PDB release, December 1993), i.e., a list of protein Reprint requests to: Uwe Hobohm, European Molecular Biology Laboratory, 69012 Heidelberg, Germany; e-mail: hobohm@embl-heidelberg.de. chains with less than 25% sequence identity, was constrained to be downward compatible with the "30% list" published in 1992 (Hobohm et al, 1992). Some adjustments were a result of the more stringent threshold and new quality criteria, e.g., replacement of a data set by one with identical sequence but higher resolution.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Enlarged representative set of protein structures

1994

Self Cite

View full text Add to dashboard Cite

TO reduce redundancy in the Protein Data Bank of 3D protein structures, which is caused by many homologous proteins in the data bank, we have selected a representative set of structures. The selection algorithm was designed to (1) select as many nonhomologous structures as possible, and (2) to select structures of good quality. The representative set may reduce time and effort in statistical analyses.Keywords: NMR; PDB; representative protein data set; X-ray crystallography. Representative selection of proteins of known 3-dimensional structureThere is considerable redundancy in the Protein Data Bank (PDB) (Bernstein et al., 1977) of 3-dimensional structures. For example, there currently are atomic coordinates for about 77 globins, 61 immunoglobulins, and 9 structures of phage T4 lysozyme, including many engineered mutants. Although the wealth of data bears witness to the progress achieved by protein crystallographers and NMR spectroscopists, an overview of the spectrum of known protein structures and certain statistical analyses of protein structures require nonredundant data. To meet this need, we have developed algorithms to select from PDB (or from sequence databases) representative subsets that aim to minimize redundancy and maximize coverage (Hobohm et al., 1992). The result, in the form of a list of PDB identifiers, was published about 1; years ago and has been very useful, e.g., in developing better algorithms for secondary structure prediction (Rost & Sander, 1993). Since then, there has been rapid growth of the number of known protein structures and, in addition, a sudden surge of preliminary data sets released by the PDBfrom about 600 in early 1992, there are now about 2,000 PDB coordinate data sets, a 3-fold increase in 18 months. It was therefore time to update the representative list and, based on experience gained since the original publication, to refine the criteria for selection. The result is an increase from 155 to 301 (95%) in the number of sequence-unique proteins. The current "25%-list" (based on PDB release, December 1993), i.e., a list of protein Reprint requests to: Uwe Hobohm, European Molecular Biology Laboratory, 69012 Heidelberg, Germany; e-mail: hobohm@embl-heidelberg.de. chains with less than 25% sequence identity, was constrained to be downward compatible with the "30% list" published in 1992 (Hobohm et al., 1992). Some adjustments were a result of the more stringent threshold and new quality criteria, e.g., replacement of a data set by one with identical sequence but higher resolution. Downward compatibility will reduce the number of changes in user applications upon new releases of the list and will be maintained in the future.The list is provided in Table 1 and on the Diskette Appendix and is also available on the EMBL file transfer (ftp) and e-mail servers. Quality control of selected data setsGiven a choice between 2 different data sets, from which only 1 can be selected into the list, one would like to use the one of higher quality. For instance, one wishes to avoi...

show abstract

“…Abola et al, 1987), comprising over 100 nonhomologous structures (Hobohm et al, 1992), it is well known that a number of structural themes recur. One example of such a recurring structure is the Rossman fold (Rossman et al, 1974), found in such proteins as lactate dehydrogenase, glyceraldehyde-3-phosphate dehydrogenase, and alcohol dehydrogenase (Branden & Tooze, 1991).…”

mentioning

confidence: 99%

Structural analysis based on state‐space modeling

1993

View full text Add to dashboard Cite

A new method has been developed to compute the probability that each amino acid in a protein sequence is in a particular secondary structural element. Each of these probabilities is computed using the entire sequence and a set of predefined structural class models. This set of structural classes is patterned after Jane Richardson's taxonomy for the domains of globular proteins. For each structural class considered, a mathematical model is constructed to represent constraints on the pattern of secondary structural elements characteristic of that class. These are stochastic models having discrete state spaces (referred to as hidden Markov models by researchers in signal processing and automatic speech recognition). Each model is a mathematical generator of amino acid sequences; the sequence under consideration is modeled as having been generated by one model in the set of candidates. The probability that each model generated the given sequence is computed using a filtering algorithm. The protein is then classified as belonging to the structural class having the most probable model. The secondary structure of the sequence is then analyzed using a "smoothing" algorithm that is optimal for that structural class model. For each residue position in the sequence, the smoother computes the probability that the residue is contained within each of the defined secondary structural elements of the model. This method has two important advantages: (1) the probability of each residue being in each of the modeled secondary structural elements is computed using the totality of the amino acid sequence, and (2) these probabilities are consistent with prior knowledge of realizable domain folds as encoded in each model. As an example of the method's utility, we present its application to flavodoxin, a prototypical a//3 protein having a central &sheet, and to thioredoxin, which belongs to a similar structural class but shares no significant sequence similarity.Keywords: flavodoxin; hidden Markov model; secondary structure; state space modeling; tertiary structure classification; thioredoxin The number of known protein sequences greatly exceeds the number of directly determined structures. This is due in part to the relative ease of protein sequence determination by DNA sequencing as compared to the difficulty and expense of structure determination by X-ray crystallography and NMR spectroscopy. Although the number of known structures continues to grow, the number of protein sequences is expected to greatly exceed the number of known structures for the foreseeable future. Because knowledge of the structure of a protein is essential to understanding its function, the elucidation of any aspect of a protein's structure from the sequence information alone is potentially useful.Although there are over 600 structures in the Brookhaven protein structural database (Bernstein et al., 1977;

show abstract

Selection of representative protein data sets

Cited by 786 publications

References 14 publications

One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties

One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties

Enlarged representative set of protein structures

Structural analysis based on state‐space modeling

Contact Info

Product

Resources

About