Pai-Hsi Huang scite author profile

Background: Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers.

show abstract

Protein homology detection with biologically inspired features and interpretable statistical models

Huang

Pavlović

2008

IJDMB

View full text Add to dashboard Cite

Computational classification of proteins using methods such as string kernels and Fisher-SVM has demonstrated great success. However, the resulting models do not offer an immediate interpretation of the underlying biological mechanisms. In particular, some recent studies have postulated the existence of a small subset of positions and residues in protein sequences may be sufficient to discriminate among different protein classes. In this work, we propose a hybrid setting for the classification task. A generative model is trained as a feature extractor, followed by a sparse classifier in the extracted feature space to determine the membership of the sequence, while discovering features relevant for classification. The set of sparse biologically motivated features and the discriminative method offer the desired biological interpretability. We apply the proposed method to a widely used dataset and show that the performance of our models is comparable to that of the state-of-the-art methods. The resulting models use fewer than 10% of the original features. At the same time, the sets of critical features discovered by the model appear to be consistent with confirmed biological findings.

show abstract

Protein homology detection with sparse models

Pavlović

Huang

2008

View full text Add to dashboard Cite

On the Role of Local Matching for Efficient Semi-supervised Protein Sequence Classification

Kuksa

Huang

Pavlović

2008

View full text Add to dashboard Cite

Recent studies in protein sequence analysis have leveraged the power of unlabeled data. For example, the profile and mismatch neighborhood kernels have shown significant improvements over classifiers estimated under the fully supervised setting. In this study, we present a principled and biologically motivated framework that more effectively exploits the unlabeled data by only utilizing regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias kernel estimations that rely on unlabeled data, we also propose a method to remove this bias and improve performance of resulting classifiers. Combined with a computationally efficient sparse family of string kernels, our proposed framework achieves state-ofthe-art accuracy in semi-supervised protein remote homology detection on three large unlabeled databases.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Pai-Hsi Huang

Fast protein homology and fold detection with sparse spatial sample kernels

Efficient use of unlabeled data for protein sequence classification: a comparative study

Protein homology detection with biologically inspired features and interpretable statistical models

Protein homology detection with sparse models

On the Role of Local Matching for Efficient Semi-supervised Protein Sequence Classification

Contact Info

Product

Resources

About