Large-scale identification of metabolites is key to elucidating and modeling metabolism at the systems level. Advances in metabolomics technologies, particularly ultra-high resolution mass spectrometry (MS) enable comprehensive and rapid analysis of metabolites. However, a significant barrier to meaningful data interpretation is the identification of a wide range of metabolites including unknowns and the determination of their role(s) in various metabolic networks. Chemoselective (CS) probes to tag metabolite functional groups combined with high mass accuracy provide additional structural constraints for metabolite identification and quantification. We have developed a novel algorithm, Chemically Aware Substructure Search (CASS) that efficiently detects functional groups within existing metabolite databases, allowing for combined molecular formula and functional group (from CS tagging) queries to aid in metabolite identification without a priori knowledge. Analysis of the isomeric compounds in both Human Metabolome Database (HMDB) and KEGG Ligand demonstrated a high percentage of isomeric molecular formulae (43 and 28%, respectively), indicating the necessity for techniques such as CS-tagging. Furthermore, these two databases have only moderate overlap in molecular formulae. Thus, it is prudent to use multiple databases in metabolite assignment, since each major metabolite database represents different portions of metabolism within the biosphere. In silico analysis of various CS-tagging strategies under different conditions for adduct formation demonstrate that combined FT-MS derived molecular formulae and CS-tagging can uniquely identify up to 71% of KEGG and 37% of the combined KEGG/HMDB database vs. 41 and 17%, respectively without adduct formation. This difference between database isomer disambiguation highlights the strength of CS-tagging for non-lipid metabolite identification. However, unique identification of complex lipids still needs additional information.
IntroductionDirect injection Fourier-transform mass spectrometry (FT-MS) allows for the high-throughput and high-resolution detection of thousands of metabolite-associated isotopologues. However, spectral artifacts can generate large numbers of spectral features (peaks) that do not correspond to known compounds. Misassignment of these artifactual features creates interpretive errors and limits our ability to discern the role of representative features within living systems.ObjectivesOur goal is to develop rigorous methods that identify and handle spectral artifacts within the context of high-throughput FT-MS-based metabolomics studies.ResultsWe observed three types of artifacts unique to FT-MS that we named high peak density (HPD) sites: fuzzy sites, ringing and partial ringing. While ringing artifacts are well-known, fuzzy sites and partial ringing have not been previously well-characterized in the literature. We developed new computational methods based on comparisons of peak density within a spectrum to identify regions of spectra with fuzzy sites. We used these methods to identify and eliminate fuzzy site artifacts in an example dataset of paired cancer and non-cancer lung tissue samples and evaluated the impact of these artifacts on classification accuracy and robustness.ConclusionOur methods robustly identified consistent fuzzy site artifacts in our FT-MS metabolomics spectral data. Without artifact identification and removal, 91.4% classification accuracy was achieved on an example lung cancer dataset; however, these classifiers rely heavily on artifactual features present in fuzzy sites. Proper removal of fuzzy site artifacts produces a more robust classifier based on non-artifactual features, with slightly improved accuracy of 92.4% in our example analysis.Electronic supplementary materialThe online version of this article (10.1007/s11306-018-1426-9) contains supplementary material, which is available to authorized users.
Improvements in Fourier transform mass spectrometry (FT-MS) enable increasingly more complex experiments in the field of metabolomics. What is directly detected in FT-MS spectra are spectral features (peaks) that correspond to sets of adducted and charged forms of specific molecules in the sample. The robust assignment of these features is an essential step for MS-based metabolomics experiments, but the sheer complexity of what is detected and a variety of analytically introduced variance, errors, and artifacts has hindered the systematic analysis of complex patterns of observed peaks with respect to isotope content. We have developed a method called SMIRFE that detects small biomolecules and determines their elemental molecular formula (EMF) using detected sets of isotopologue peaks sharing the same EMF. SMIRFE does not use a database of known metabolite formulas; instead a nearly comprehensive search space of all isotopologues within a mass range is constructed and used for assignment. This search space can be tailored for different isotope labeling patterns expected in different stable isotope tracing experiments. Using consumer-level computing equipment, a large search space of 2000 Da was constructed, and assignment performance was evaluated and validated using verified assignments on a pair of peak lists derived from spectra containing unlabeled and 15 N-labeled versions of amino acids derivatized using ethylchloroformate. SMIRFE identified 18 of 18 predicted derivatized EMFs, and each assignment was evaluated statistically and assigned an e-value representing the probability to occur by chance.
Despite instrument and algorithmic improvements, the untargeted and accurate assignment of metabolites remains an unsolved problem in metabolomics. New assignment methods such as our SMIRFE algorithm can assign elemental molecular formulas to observed spectral features in a highly untargeted manner without orthogonal information from tandem MS or chromatography. However, for many lipidomics applications, it is necessary to know at least the lipid category or class that is associated with a detected spectral feature to derive a biochemical interpretation. Our goal is to develop a method for robustly classifying elemental molecular formula assignments into lipid categories for an application to SMIRFE-generated assignments. Using a Random Forest machine learning approach, we developed a method that can predict lipid category and class from SMIRFE non-adducted molecular formula assignments. Our methods achieve high average predictive accuracy (>90%) and precision (>83%) across all eight of the lipid categories in the LIPIDMAPS database. Classification performance was evaluated using sets of theoretical, data-derived, and artifactual molecular formulas. Our methods enable the lipid classification of non-adducted molecular formula assignments generated by SMIRFE without orthogonal information, facilitating the biochemical interpretation of untargeted lipidomics experiments. This lipid classification appears insufficient for validating single-spectrum assignments, but could be useful in cross-spectrum assignment validation.
Lung cancer remains the leading cause of cancer death worldwide and non-small cell lung carcinoma (NSCLC) represents 85% of newly diagnosed lung cancers. In this study, we utilized our untargeted assignment tool Small Molecule Isotope Resolved Formula Enumerator (SMIRFE) and ultra-high-resolution Fourier transform mass spectrometry to examine lipid profile differences between paired cancerous and non-cancerous lung tissue samples from 86 patients with suspected stage I or IIA primary NSCLC. Correlation and co-occurrence analysis revealed significant lipid profile differences between cancer and non-cancer samples. Further analysis of machine-learned lipid categories for the differentially abundant molecular formulas identified a high abundance sterol, high abundance and high m/z sphingolipid, and low abundance glycerophospholipid metabolic phenotype across the NSCLC samples. At the class level, higher abundances of sterol esters and lower abundances of cardiolipins were observed suggesting altered stearoyl-CoA desaturase 1 (SCD1) or acetyl-CoA acetyltransferase (ACAT1) activity and altered human cardiolipin synthase 1 or lysocardiolipin acyltransferase activity respectively, the latter of which is known to confer apoptotic resistance. The presence of a shared metabolic phenotype across a variety of genetically distinct NSCLC subtypes suggests that this phenotype is necessary for NSCLC development and may result from multiple distinct genetic lesions. Thus, targeting the shared affected pathways may be beneficial for a variety of genetically distinct NSCLC subtypes.
Metabolic flux analysis requires both a reliable metabolic model and reliable metabolic profiles in characterizing metabolic reprogramming. Advances in analytic methodologies enable production of high-quality metabolomics datasets capturing isotopic flux. However, useful metabolic models can be difficult to derive due to the lack of relatively complete atom-resolved metabolic networks for a variety of organisms, including human. Here, we developed a neighborhood-specific graph coloring method that creates unique identifiers for each atom in a compound facilitating construction of an atom-resolved metabolic network. What is more, this method is guaranteed to generate the same identifier for symmetric atoms, enabling automatic identification of possible additional mappings caused by molecular symmetry. Furthermore, a compound coloring identifier derived from the corresponding atom coloring identifiers can be used for compound harmonization across various metabolic network databases, which is an essential first step in network integration. With the compound coloring identifiers, 8865 correspondences between KEGG (Kyoto Encyclopedia of Genes and Genomes) and MetaCyc compounds are detected, with 5451 of them confirmed by other identifiers provided by the two databases. In addition, we found that the Enzyme Commission numbers (EC) of reactions can be used to validate possible correspondence pairs, with 1848 unconfirmed pairs validated by commonality in reaction ECs. Moreover, we were able to detect various issues and errors with compound representation in KEGG and MetaCyc databases by compound coloring identifiers, demonstrating the usefulness of this methodology for database curation.
Large‐scale identification of metabolites is key to understanding metabolism at the systems level. Advances in metabolomics technologies, particularly ultra‐high resolution mass spectrometry enable rapid, comprehensive analysis of metabolites that is impractical to achieve by conventional methods. A significant barrier to meaningful data interpretation is the identification of metabolites including unknowns and the determination of their role(s) in metabolic networks. Chemoselective (CS) probes to tag metabolite functional groups combined with high mass accuracy provide additional structural constraints for metabolite identification and quantification. We have developed a novel algorithm that efficiently detects functional groups within existing metabolite databases such as KEGG Ligand and the Human Metabolome Database, allowing for combined molecular formula and functional group queries to aid in metabolite identification without a priori knowledge. Analysis of the isomeric compounds in both HMDB and KEGG demonstrated a high percentage of isomeric molecular formulae (43% and 28% respectively), indicating the necessity for techniques such as CS‐tagging. Furthermore, these databases have only moderate overlap in molecular formulae. Thus, it is prudent to use multiple databases in metabolite assignment, since each major metabolite database represents different portions of metabolism. In silico analysis of CS‐tagging strategies demonstrate that combined FT‐MS derived molecular formulae and CS‐tagging can uniquely identify up to 71% of KEGG and 37% of the combined KEGG/HMDB database compared with 41% and 17% respectively without adduct formation.
IntroductionAlthough Fourier-transform mass spectrometry has substantially improved our ability to detect lipids and other metabolites; the untargeted and accurate assignment of detected metabolites remains an unsolved problem in metabolomics. New assignment methods such as our SMIRFE algorithm can assign elemental molecular formula to observed spectral features in an untargeted manner without orthogonal information from tandem MS or chromatography. However, for many lipidomics applications, it is necessary to know at least the lipid category or class that is associated with a detected spectral feature in order to derive biochemical interpretation. ObjectivesOur goal is to develop a method for robustly classifying elemental molecular formula assignments into lipid categories for application to SMIRFE-generated assignments. ResultsUsing machine learning, we developed a method that can predict lipid category and class from SMIRFE molecular formula assignments. Our methods achieve high accuracy (>90%) and precision (>83%) for all eight of the lipid categories in the LIPIDMAPS database. Model performance was evaluated using sets of theoretical, data-derived, and artifactual molecular formulas. Our models were generalizable, applicable to real-world datasets, and very discriminating with most molecular formulas classified to the "not lipid" category. Lipid categories with the highest classification propensities were glycerophospholipids and sphingolipids, matching the highest category prevalence in LIPIDMAPS. ConclusionsOur methods enable the lipid classification of untargeted molecular formula assignments generated by SMIRFE without orthogonal information, facilitating biochemical interpretation of highly untargeted lipidomics experiments. However, this lipid classification appears insufficient for validating single-spectrum assignments, but could be useful in cross-spectrum assignment validation. Author ContributionsJMM designed and implemented the machine learning models and the convex hull analysis. HNBM designed the m/z shifted formula analysis and JMM implemented it.The primary manuscript writers were JMM and HNBM. Abstract:These pages contain supporting information including descriptions of the tissue samples from which the experimental set of formulas was derived, additional result tables for the machine learning models, and a figure showing the distribution of molecular formulas across our training datasets with respect to m/z.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.