One of the basic issues that arises in functional genomics is the ability to predict the subcellular location of proteins that are deduced from gene and genome sequencing. In particular, one would like to be able to readily specify those proteins that are soluble and those that are inserted in a membrane. Traditional methods of distinguishing between these two locations have relied on extensive, time-consuming biochemical studies. The alternative approach has been to make inferences based on a visual search of the amino acid sequences of presumed gene products for stretches of hydrophobic amino acids. This numerical, sequence-based approach is usually seen as a first approximation pending more reliable biochemical data. The recent availability of large and complete sequence data sets for several organisms allows us to determine just how accurate such a numerical approach could be, and to attempt to minimize and quantify the error involved. We have optimized a statistical approach to protein location determination. Using our approach, we have determined that surprisingly few proteins are misallocated using the numerical method. We also examine the biological implications of the success of this technique.Keywords: computer modeling; discriminant analysis; hydropathy; membrane proteins; statistical methods Experimental methods of determination of subcellular protein location are accurate but time-consuming. Hydropathy analysis (Kyte & Doolittle, 1982) has often been used to deduce subcellular localization of proteins in the absence of experimental data. However, although visual inspection of hydropathy plots can be useful in predicting the topology of known integral membrane proteins, it is ineffective as an accurate predictor of the location of a protein.To discriminate between integral and peripheral membrane proteins, Klein et al. (1985) generated a single number, maxH, the average hydropathy of the most hydrophobic protein segment of a given length for a given protein using a given hydropathy scale. In the interest of clarity, we will refer to this number as the "maxH value," while using the term "maxH segment" to refer to the hydrophobic peptide segment to which it belongs. This was then applied to a set of known integral and peripheral membrane proteins in a training set. A discriminator function was generated that assigned a probability of being an integral membrane protein to a given value of maxH. This function was then used to analyze a similar set of known proteins, the tester set. It was determined that the Kyte-Doolittle hydropathy scale and a window length of 17 residues gave the best resolution of membrane and soluble proteins in the tester set. We have found that the method used by Klein et al. (1985) is still generally useful, but that the actual functions provided in their paper are not, having been derived at a time when very few proteins were both sequenced and characterized.
ResultsWe discovered the need for a new discriminator when we attempted to apply the functions described by Klei...