The structural identification of unknown biochemical compounds in complex biofluids continues to be a major challenge in metabolomics research. Using LC/MS there are currently two major options for solving this problem: searching small biochemical databases, which often do not contain the unknown of interest, or searching large chemical databases which include large numbers of non-biochemical compounds. Searching larger chemical databases (larger chemical space) increases the odds of identifying an unknown biochemical compound, but only if non-biochemical structures can be eliminated from consideration. In this paper we present BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds (as scaffolds) and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy. In a leave-one-out cross validation experiment, BioSM correctly predicted 95% of 1,388 Kyoto Encyclopedia of Genes and Genomes (KEGG) compounds as endogenous mammalian biochemicals using 1,565 scaffolds. Analysis of two additional biological datasets containing 2,330 human metabolites (HMDB) and 2,416 plant secondary metabolites (KEGG) resulted in biochemical annotations of 89% and 72% of the compounds respectively. When a dataset of 3,895 drugs (DrugBank and USAN) was tested, 48% of these structures were predicted to be biochemical. However, when a set of synthetic chemical compounds (Chembridge and Chemsynthesis databases) were examined, only 29% of the 458,207 structures were predicted to be biochemical. Moreover, BioSM predicted that 34% of 883,199 randomly selected compounds from PubChem were biochemical. We then expanded the scaffold list to 3,927 biochemical compounds and reevaluated the above datasets to determine whether scaffold number influenced model performance. Although there were significant improvements in model sensitivity and specificity using the larger scaffold list, the dataset comparison results were very similar. These results suggest that additional biochemical scaffolds will not further improve our representation of biochemical structure space and that the model is reasonably robust. BioSM provides a qualitative (yes/no) and quantitative (ranking) method for endogenous mammalian biochemical annotation of chemical space, and thus will be useful in the identification of unknown biochemical structures in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu.
Current methods of structure identification in mass spectrometry based non-targeted metabolomics rely on matching experimentally determined features of an unknown compound to those of candidate compounds contained in biochemical databases. A major limitation of this approach is the relatively small number of compounds currently included in these databases. If the correct structure is not present in a database it cannot be identified, and if it cannot be identified it cannot be included in a database. Thus, there is an urgent need to augment metabolomics databases with rationally designed biochemical structures using alternative means. In this study, we present a database of in silico enzymatically synthesized metabolites (IIMDB) to partially address this problem. The database, which is available from http://metabolomics.pharm.uconn.edu/iimdb/, includes ~23,000 known compounds (mammalian metabolites, drugs, secondary plant metabolites and glycerophospholipids) collected from existing biochemical databases plus more than 400,000 computationally generated human phase I and phase II metabolites of these known compounds. The IIMDB database features a user-friendly web interface and a programmer-friendly RESTful web service. Ninety-five percent of the computationally generated metabolites in IIMDB were not found in any existing database. However, 21,640 were identical to compounds already listed in PubChem, HMDB, KEGG or HumanCyc. Furthermore, a vast majority of these in silico metabolites were scored as biological using BioSM, a software program that identifies biochemical structures in chemical structure space. These results suggest that in silico biochemical synthesis represents a viable approach for significantly augmenting biochemical databases for non-targeted metabolomics applications.
Metabolic pathways are composed of a series of chemical reactions occurring within a cell. In each pathway, enzymes catalyze the conversion of substrates into structurally similar products. Thus, structural similarity provides a potential means for mapping newly identified biochemical compounds to known metabolic pathways. In this paper, we present TrackSM, a cheminformatics tool designed to associate a chemical compound to a known metabolic pathway based on molecular structure matching techniques. Validation experiments show that TrackSM is capable of associating 93% of tested structures to their correct KEGG pathway class and 88% to their correct individual KEGG pathway. This suggests that TrackSM may be a valuable tool to aid in associating previously unknown small molecules to known biochemical pathways and improve our ability to link metabolomics, proteomic, and genomic data sets. TrackSM is freely available at http://metabolomics.pharm.uconn.edu/?q=Software.html .
The identification of compounds in complex mixtures remains challenging despite recent advances in analytical techniques. At present, no single method can detect and quantify the vast array of compounds that might be of potential interest in metabolomics studies. High performance liquid chromatography/mass spectrometry (HPLC/MS) is often considered the analytical method of choice for analysis of biofluids. The positive identification of an unknown involves matching at least two orthogonal HPLC/MS measurements (exact mass, retention index, drift time etc.) against an authentic standard. However, due to the limited availability of authentic standards, an alternative approach involves matching known and measured features of the unknown compound with computationally predicted features for a set of candidate compounds downloaded from a chemical database. Computationally predicted features include retention index, ECOM50 (energy required to decompose 50% of a selected precursor ion in a collision induced dissociation cell), drift time, whether the unknown compound is biological or synthetic and a collision induced dissociation (CID) spectrum. Computational predictions are used to filter the initial “bin” of candidate compounds. The final output is a ranked list of candidates that best match the known and measured features. In this mini review, we discuss cheminformatics methods underlying this database search-filter identification approach.
Metabolomics is the study of small molecules, called metabolites, of a cell, tissue or organism. It is of particular interest as endogenous metabolites represent the phenotype resulting from gene expression. A major challenge in metabolomics research is the structural identification of unknown biochemical compounds in complex biofluids. In this paper we present an efficient cheminformatics tool, BioSMXpress that uses known endogenous mammalian biochemicals and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSMXpress identifies endogenous mammalian biochemical structures with high accuracy. BioSMXpress is 8 times faster than our previous work BioSM without compromising the accuracy of the predictions made. BioSMXpress is freely available at http://engr.uconn.edu/~rajasek/BioSMXpress.zip
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.