Background: Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers.
Computational classification of proteins using methods such as string kernels and Fisher-SVM has demonstrated great success. However, the resulting models do not offer an immediate interpretation of the underlying biological mechanisms. In particular, some recent studies have postulated the existence of a small subset of positions and residues in protein sequences may be sufficient to discriminate among different protein classes. In this work, we propose a hybrid setting for the classification task. A generative model is trained as a feature extractor, followed by a sparse classifier in the extracted feature space to determine the membership of the sequence, while discovering features relevant for classification. The set of sparse biologically motivated features and the discriminative method offer the desired biological interpretability. We apply the proposed method to a widely used dataset and show that the performance of our models is comparable to that of the state-of-the-art methods. The resulting models use fewer than 10% of the original features. At the same time, the sets of critical features discovered by the model appear to be consistent with confirmed biological findings.
Recent studies in protein sequence analysis have leveraged the power of unlabeled data. For example, the profile and mismatch neighborhood kernels have shown significant improvements over classifiers estimated under the fully supervised setting. In this study, we present a principled and biologically motivated framework that more effectively exploits the unlabeled data by only utilizing regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias kernel estimations that rely on unlabeled data, we also propose a method to remove this bias and improve performance of resulting classifiers. Combined with a computationally efficient sparse family of string kernels, our proposed framework achieves state-ofthe-art accuracy in semi-supervised protein remote homology detection on three large unlabeled databases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.