Despite recent advances in NMR and mass spectrometry, the structural identification of organic compounds in complex biofluids remains a significant analytical challenge. For mass spectroscopy applications, chemical identification is generally limited to determination of elemental formula. Here we test the hypothesis that unknown chemical structures can be determined by matching their experimental collision-induced dissociation (CID) fragmentation spectra with computational fragmentation spectra of compounds retrieved from chemical databases. The monoisotopic molecular weights (MIMW +/- 10 ppm) of 102 "test" compounds were used to download 102 "bins" from the PubChem database. Each bin contained the corresponding test compound and, on average, 272 other candidate compounds, including 158 compounds having the same elemental formula as the test compound. Commercially available software was used to generate fragmentation spectra for all compounds in each of the 102 bins. Experimental CID spectra for each of the 102 test compounds were then compared to the computational spectra in order to rank candidate compounds based on number of fragment MIMW matches. This method returned the test compound as the highest ranking (or tied with the highest ranking) compound for 65 of the 102 bins. The test compound was ranked within the top 20 candidate compounds for 87 bins. In addition, the correct elemental formula was ranked first for 98 of 102 bins. Thus, matching experimental with computational fragmentation spectra is a valid method for rapidly discriminating among compounds having the same elemental formula and provides a novel approach for querying chemical databases for structural information.
Survival yield analysis is routinely used in mass spectroscopy as a tool for assessing precursor ion stability and internal energy. Because ion internal energy and decomposition reaction rates are dependent on chemical structure, we reasoned that survival yield curves should be compound-specific and therefore useful for chemical identification. In this study, a quantitative approach for analyzing the correlation between survival yield and collision energy was developed and validated. This method is based on determining the collision energy (CE) at which the survival yield is 50% (CE 50 ) and, further, provides slope and intercept values for each survival yield curve. In initial experiments using a defined set of homologous compounds, we found that CE 50 values were easily determined, quantitative, highly reproducible, and could discriminate between structural and even positional isomers. Further analysis demonstrated that CE 50 values were independent of cone potential and orthogonal to compound mass. Experimentally determined CE 50 values for a diverse set of 54 compounds were correlated to Molconn molecular structure descriptors. The resulting model yielded a statistically significant linear correlation between experimental and calculated CE 50 values and identified several structural characteristics related to precursor ion stability and fragmentation mechanism. Thus, the CE 50 is a promising method for compound identification and discrimination. S urvival yield analysis was initially developed as a tool to quantify the distribution of precursor ion internal energies to explain fragmentation patterns that occur using mass spectrometry [1]. Survival yield has since been used as a method to correlate conditions in the mass spectrometer to the energetics of sample ions. These studies have developed a wide array of equations for understanding molecular decomposition in a mass spectrometer. The quasi-equilibrium theory of a unimolecular reaction indicates that the rate of molecular decomposition, as occurs in collision induced dissociation (CID), is dependent on the molecule's internal energy (E), activation energy (E 0 ), number of vibrational degrees of freedom (n), and the entropy of the reaction transition-state (⌬S*). E 0 , n, and ⌬S* are dependent on the structure of the molecule [2], whereas E is a function of the kinetic energy applied to the molecule in the collision cell. The fraction of a precursor molecule that survives a CID reaction (survival yield) depends on the reaction rate and the reaction time in the collision cell. In CID, transferring a portion of the kinetic energy of the accelerated precursor ion to internal energy by collisions with relatively stationary gas atoms increases the internal energy of a sample ion. The maximum energy available for absorption (E com ) by the precursor ion in the collision process is described by eq 1 [3]:where, E com is the center of mass kinetic energy, m G is the mass of the collision gas, E kin is the kinetic energy of the sample ion, and m i is the mass of t...
The goal of many metabolomic studies is to identify the molecular structure of endogenous molecules that are differentially expressed among sampled or treatment groups. The identified compounds can then be used to gain an understanding of disease mechanisms. Unfortunately, despite recent advances in a variety of analytical techniques, small molecule (<1000 Da) identification remains difficult. Rarely can a chemical structure be determined from experimental “features” such as retention time, exact mass, and collision induced dissociation spectra. Thus, without knowing structure, biological significance remains obscure. In this study we explore an identification method in which the measured exact mass of an unknown is used to query available chemical databases to compile a list of candidate compounds. Predictions are made for the candidates using models of experimental features that have been measured for the unknown. The predicted values are used to filter the candidate list by eliminating compounds with predicted values substantially different from the unknown. The intent is to reduce the list of candidates to a reasonable number that can be obtained and measured for confirmation. To facilitate this exploration, we measured data and created models for two experimental features; MS Ecom50 (the energy in eV required to fragment 50% of a selected precursor ion) and HPLC retention index. Using a dataset of 52 compounds, Ecom50 models were developed based on both Molconn and CODESSA structural descriptors. These models gave r2 values of 0.89 to 0.94 depending on the number of inputs, the modeling algorithm chosen, and whether neutral or protonated structures were used. The retention index model was developed with 400 compounds using a back propagation artificial neural network and 33 Molconn structure descriptors. External validation gave a v2 = 0.86 and standard error of 38 retention index units. As a test of the validity of the filtering approach, the Ecom50 and retention index models, along with exact mass and collision induced dissociation spectra matching, were used to identify 1,3-dicyclohexylurea in human plasma. This compound was not previously known to exist in human biofluids and its elemental formula was identical to 315 other candidate compounds downloaded from PubChem. These results suggest that the use of Ecom50 and retention index predictive models can improve non-targeted metabolite structure identification using HPLC/MS derived structural features.
A back-propagation artificial neural network (ANN) was used to create a 10-fold leave-10%-out cross-validated ensemble model of high performance liquid chromatography retention index (HPLC-RI) for a data set of 498 diverse druglike compounds. A 10-fold multiple linear regression (MLR) ensemble model of the same data was developed for comparison. Molecular structure was described using IGroup E-state indices, a novel set of structure-information representation (SIR) descriptors, along with molecular connectivity chi and kappa indices and other SIR descriptors previously reported. The same input descriptors were used to develop models by both learning algorithms. The MLR model yielded marginally acceptable statistics with training correlation r(2) = 0.65, mean absolute error (MAE) = 83 RI units. External validation of 104 compounds not used for model development yielded validation v(2) = 0.49 and MAE = 73 RI units. The distribution of residuals for the fit and validate data sets suggest a nonlinear relationship between retention index and molecular structure as described by the SIR indices. Not surprisingly, the ANN model was significantly more accurate for both training and validation with training set r(2) = 0.93, MAE = 30 RI units and validation v(2) = 0.84, MAE = 41 RI units. For the ANN model, a total of 91% of validation predictions were within 100 RI units of the experimental value.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.