Abstract. This paper presents an interdisciplinary investigation of statistical information retrieval (IR) techniques for protein identification from tandem mass spectra, a challenging problem in proteomic data analysis. We formulate the task as an IR problem, by constructing a "query vector" whose elements are system-predicted peptides with confidence scores based on spectrum analysis of the input sample, and by defining the vector space of "documents" with protein profiles, each of which is constructed based on the theoretical spectrum of a protein. This formulation establishes a new connection from the protein identification problem to a probabilistic language modeling approach as well as the vector space models in IR, and enables us to compare fundamental differences in the IR models and common approaches in protein identification. Our experiments on benchmark spectrometry query sets and large protein databases demonstrate that the IR models significantly outperform wellestablished methods in protein identification, by enhancing precision in highrecall regions in particular, where the conventional approaches are weak.
We present the first interdisciplinary work on transforming a popular problem in proteomics, i.e. protein identification from tandem mass spectra, to an Information Retrieval (IR) problem. We present an empirical comparison of popular IR approaches, such as those available from Indri and Lemur toolkits on benchmark datasets, to representative popular baselines in the proteomics literature. Our experiments demonstrate statistically significant evidence that popular IR approaches outperform representative baseline approaches in proteomics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.