y-Randomization is a tool used in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r2) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure. We compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors or random number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection. For each combination of number of observations (compounds), number of descriptors in the final model, and number of descriptors in the pool to select from, computer experiments using the same descriptor selection method result in two different mean highest random r2 values. A lower one is produced by y-randomization or a variant likewise based on the original descriptors, while a higher one is obtained from variants that use random number pseudodescriptors. The difference is due to the intercorrelation of real descriptors in the pool. We propose to compare an original model's r2 to both of these whenever possible. The meaning of the three possible outcomes of such a double test is discussed. Often y-randomization is not available to a potential user of a model, due to the values of all descriptors in the pool for all compounds not being published. In such cases random number experiments as proposed here are still possible. The test was applied to several recently published MLR QSAR equations, and cases of failure were identified. Some progress also is reported toward the aim of obtaining the mean highest r2 of random pseudomodels by calculation rather than by tedious multiple simulations on random number variables.
This perspective article provides an assessment of the state-of-the-art in the molecular-resolution analysis of complex organic materials. These materials can be divided into biomolecules in complex mixtures (which are amenable to successful separation into unambiguously defined molecular fractions) and complex nonrepetitive materials (which cannot be purified in the conventional sense because they are even more intricate). Molecular-level analyses of these complex systems critically depend on the integrated use of high-performance separation, high-resolution organic structural spectroscopy and mathematical data treatment. At present, only high-precision frequency-derived data exhibit sufficient resolution to overcome the otherwise common and detrimental effects of intrinsic averaging, which deteriorate spectral resolution to the degree of bulk-level rather than molecular-resolution analysis. High-precision frequency measurements are integral to the two most influential organic structural spectroscopic methods for the investigation of complex materials-NMR spectroscopy (which provides unsurpassed detail on close-range molecular order) and FTICR mass spectrometry (which provides unrivalled resolution)-and they can be translated into isotope-specific molecular-resolution data of unprecedented significance and richness. The quality of this standalone de novo molecularlevel resolution data is of unparalleled mechanistic relevance and is sufficient to fundamentally advance our understanding of the structures and functions of complex biomolecular mixtures and nonrepetitive complex materials, such as natural organic matter (NOM), aerosols, and soil, plant and microbial extracts, all of which are currently poorly amenable to meaningful target analysis. The discrete analytical volumetric pixel space that is presently available to describe complex systems (defined by NMR, FT mass spectrometry and separation technologies) is in the range of 10 8-14 voxels, and is therefore capable of providing the necessary detail for a meaningful molecular-level analysis of very complex mixtures. Nonrepetitive complex materials exhibit mass spectral signatures in which the signal intensity often follows the number of chemically feasible isomers. This suggests that even the most strongly resolved FTICR mass spectra of complex materials represent simplified (e.g. isomer-filtered) projections of structural space.
The construction of complete lists of regular graphs up to isomorphism is one of the oldest problems in constructive combinatorics. In this paper an efficient algorithm to generate regular graphs with given number of vertices and vertex degree is introduced. The method is based on orderly generation refined by criteria to avoid isomorphism checking and combined with a fast test for canonicity. The implementation allows to compute even large classes of graphs, like construction of the 4-regular graphs on 18 vertices and, for the first time, the 5-regular graphs on 16 vertices. Also in cases with given girth some remarkable results are obtained. For instance the 5-regular graphs with girth 5 and minimal number of vertices were generated in less than one hour. There exist exactly four (5,5)-cages.
This article explores consensus structure elucidation on the basis of GC/EI-MS, structure generation, and calculated properties for unknown compounds. Candidate structures were generated using the molecular formula and substructure information obtained from GC/EI-MS spectra. Calculated properties were then used to score candidates according to a consensus approach, rather than filtering or exclusion. Two mass spectral match calculations (MOLGEN-MS and MetFrag), retention behavior (Lee retention index/boiling point correlation, NIST Kovat's retention index), octanol−water partitioning behavior (log K ow ), and finally steric energy calculations were used to select candidates. A simple consensus scoring function was developed and tested on two unknown spectra detected in a mutagenic subfraction of a water sample from the Elbe River using GC/EI-MS. The top candidates proposed using the consensus scoring technique were purchased and confirmed analytically using GC/EI-MS and LC/MS/MS. Although the compounds identified were not responsible for the sample mutagenicity, the structure-generation-based identification for GC/EI-MS using calculated properties and consensus scoring was demonstrated to be applicable to real-world unknowns and suggests that the development of a similar strategy for multidimensional highresolution MS could improve the outcomes of environmental and metabolomics studies.
Three programs were assessed for their ability to predict mass spectral fragmentation patterns for all constitutional isomers of an experimental low-resolution electron impact mass spectrum (EI-MS), given the molecular formula, and use this information to identify the "correct structure". MOLGEN 3.5 was used to generate the structures, while all spectra were extracted from the NIST database. The commercial programs Mass Frontier and ACD MS Manager, as well as MOLGEN-MSF (developed by the University of Bayreuth) were used to generate mass spectral fragments. MOLGEN-MSF was used to generate "match values" to compare the different programs and their ability to identify the "correct structure". Although high match values could be achieved with certain settings, the ranking of the correct structure relative to other constitutional isomers was not significantly better than the results published previously and in some cases significantly worse. Furthermore, all programs showed bias toward specific structures, which changed significantly with minor changes to the program settings. Thus, advances in mass spectral fragment prediction have not necessarily improved computer aided structure elucidation (CASE) from EI-MS and indicate that caution must be used when confirming the identity of a compound only based on the match between its predicted fragments and the mass spectrum.
The identification of unknown compounds based on GC/EI-MS spectrum and structure generation techniques has been improved by combining a number of strategies into a programmed sequence. The program MOLGEN-MS is used to determine the molecular formula and incorporate substructural information to generate all structures matching the mass spectral information. Mass spectral fragments are then predicted for each structure and compared with the experimental spectrum using a match value. Additional data are then calculated automatically for each candidate to allow exclusion of candidates that did not match other analytical information. The effectiveness of these "exclusion criteria", as well as the programming sequence, was tested using a case study of 29 isomers of formula C(12)H(10)O(2). The default classifier precision resulted in the generation of too many structures in some cases, which was improved by up to several orders of magnitude by including additional classifiers or restrictions. Combining this with the exclusion of candidates based on a Lee retention index/boiling point correlation, octanol-water partitioning coefficients, steric energies, and finally spectral match values limited the number of candidate structures further from over 1 billion without any restrictions down to less than 6 structures in 10 cases and below 35 in all but 3 cases. This method can be used in the absence of matching database spectra and brings unknown identification based on MS interpretation and structure generation techniques a step closer to practical reality.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.