A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20-and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov-Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.79% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications.
APPLICATION OF QUERY-BASED QUALITATIVE DESCRIPTORS IN CONJUNCTION WITH PROTEIN SEQUENCE HOMOLOGY FOR PREDICTION OF RESIDUE SOLVENT ACCESSIBILITY by Reecha Nepal Characterization of relative solvent accessibility (RSA) plays a major role in classifying a given protein residue as being on the surface or buried. This information is useful for studying protein structure and protein-protein interactions, and it is usually the first approach applied in the prediction of 3-dimensional (3D) protein structures. Various complicated and time-consuming methods, such as machine learning, have been applied in solvent-accessibility predictions. In this thesis, we presented a simple application of linear regression methods using various sequence homology values for each residue as well as query residue qualitative predictors corresponding to each of the 20 amino acids. Initially, a fit was generated by applying linear regression to training sets with a variety of sequence homology parameters, including various sequence entropies and residue qualitative predictors. Then the coefficients generated via the training sets were applied to the test set, and, subsequently, the predicted RSA values were extracted for the test set. The qualitative predictors describe the actual query residue type (e.g., Gly) as opposed to the measures of sequence homology for the aligned subject residues. The prediction accuracies were calculated by comparing the predicted RSA values with NACCESS RSA (derived from X-ray crystallography). The utilization of qualitative predictors yielded significant prediction accuracy. v ACKNOWLEDGEMENT First of all, I would like to express my gratitude towards my research advisor, Dr. Brooke Lustig, for guiding me through all of this work. I would like to thank him for his patience, the amount of time he spent with me, and all the opportunities he provided me. I will forever be thankful for his encouragement and his positive attitude when things were not working as expected. I would also like to thank my M. S. thesis committee members, Dr. Daryl Eggers and Dr. Marc d'Alarcao, first for agreeing to be in my committee. Second, for spending time in reading and providing valuable feedback on my thesis. I would also like to thank my parents and my brother for their unshakable believe in me. Finally, I would like to extend my heartfelt gratitude towards my husband, Sailesh Agrawal, without whose help, support and encouragement this work would not have been possible. vi CONTENTS List of Abbreviations .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.