TO reduce redundancy in the Protein Data Bank of 3D protein structures, which is caused by many homologous proteins in the data bank, we have selected a representative set of structures. The selection algorithm was designed to (1) select as many nonhomologous structures as possible, and (2) to select structures of good quality. The representative set may reduce time and effort in statistical analyses.Keywords: NMR; PDB; representative protein data set; X-ray crystallography.
Representative selection of proteins of known 3-dimensional structureThere is considerable redundancy in the Protein Data Bank (PDB) (Bernstein et al., 1977) of 3-dimensional structures. For example, there currently are atomic coordinates for about 77 globins, 61 immunoglobulins, and 9 structures of phage T4 lysozyme, including many engineered mutants. Although the wealth of data bears witness to the progress achieved by protein crystallographers and NMR spectroscopists, an overview of the spectrum of known protein structures and certain statistical analyses of protein structures require nonredundant data. To meet this need, we have developed algorithms to select from PDB (or from sequence databases) representative subsets that aim to minimize redundancy and maximize coverage (Hobohm et al., 1992). The result, in the form of a list of PDB identifiers, was published about 1; years ago and has been very useful, e.g., in developing better algorithms for secondary structure prediction (Rost & Sander, 1993). Since then, there has been rapid growth of the number of known protein structures and, in addition, a sudden surge of preliminary data sets released by the PDBfrom about 600 in early 1992, there are now about 2,000 PDB coordinate data sets, a 3-fold increase in 18 months. It was therefore time to update the representative list and, based on experience gained since the original publication, to refine the criteria for selection. The result is an increase from 155 to 301 (95%) in the number of sequence-unique proteins. The current "25%-list" (based on PDB release, December 1993), i.e., a list of protein Reprint requests to: Uwe Hobohm, European Molecular Biology Laboratory, 69012 Heidelberg, Germany; e-mail: hobohm@embl-heidelberg.de. chains with less than 25% sequence identity, was constrained to be downward compatible with the "30% list" published in 1992 (Hobohm et al., 1992). Some adjustments were a result of the more stringent threshold and new quality criteria, e.g., replacement of a data set by one with identical sequence but higher resolution. Downward compatibility will reduce the number of changes in user applications upon new releases of the list and will be maintained in the future.The list is provided in Table 1 and on the Diskette Appendix and is also available on the EMBL file transfer (ftp) and e-mail servers.
Quality control of selected data setsGiven a choice between 2 different data sets, from which only 1 can be selected into the list, one would like to use the one of higher quality. For instance, one wishes to avoi...