Preliminary
compound identification and peak annotation in gas
chromatography–mass spectrometry is usually made using mass
spectral databases. There are a few algorithms that enable performing
a search of a spectrum in a large mass spectral library. In many cases,
a library search procedure returns a wrong answer even if a correct
compound is contained in a library. In this work, we present a deep
learning driven approach to a library search in order to reduce the
probability of such cases. Machine learning ranking (learning to rank)
is a class of machine learning and deep learning algorithms that perform
a comparison (ranking) of objects. This work introduces the usage
of deep learning ranking for small molecules identification using
low-resolution electron ionization mass spectrometry. Instead of simple
similarity measures for two spectra, such as the dot product or the
Euclidean distance between vectors that represent spectra, a deep
convolutional neural network is used. The deep learning ranking model
outperforms other approaches and enables reducing a fraction of wrong
answers (at rank-1) by 9–23% depending on the used data set.
Spectra from the Golm Metabolome Database, Human Metabolome Database,
and FiehnLib were used for testing the model.
Gas chromatography is a widely used method in analytical chemistry and metabolomics. Using gas chromatography, vaporizable compounds can be separated for their further identification. Retention indices are standardized values that depend only on a chemical structure of a compound and on a stationary phase and characterize the retention of a compound in a chromatographic system. Retention index prediction is an important task because databases contain experimental values for a small fraction of all possible molecules, while this information is usable for untargeted analysis. In this work, we consider four machine learning models for retention index prediction: 1D and 2D convolutional neural networks, deep residual multilayer perceptron, and gradient boosting. String representation of the molecule, 2D representation of the chemical structure, molecular descriptors and fingerprints, and molecular descriptors are used as inputs of these four models, respectively, along with information about the stationary phase. The first and third models show the best performance, while the other two perform slightly worse. The models predict retention index values for various standard and semi-standard non-polar stationary phases. Further improvement in performance was achieved using a linear model that uses the results of four previous models as inputs (model stacking). The models were tested using various diverse data sets: flavor compounds, essential oils, metabolomics-related compounds. Achieved accuracy: median absolute and percentage errors -6-40 units and 0.8-2.2%. Accuracy depends on a test data set. The stacking model outperforms previously reported approaches for all test data sets. Parameters of a pre-trained model and some source code are provided.INDEX TERMS Analytical chemistry, convolutional neural network, deep learning, gas chromatography, gradient boosting, residual neural network, retention index, untargeted chemical analysis.
Prediction of gas chromatographic retention indices based on compound structure is an important task for analytical chemistry. The predicted retention indices can be used as a reference in a mass spectrometry library search despite the fact that their accuracy is worse in comparison with the experimental reference ones. In the last few years, deep learning was applied for this task. The use of deep learning drastically improved the accuracy of retention index prediction for non-polar stationary phases. In this work, we demonstrate for the first time the use of deep learning for retention index prediction on polar (e.g., polyethylene glycol, DB-WAX) and mid-polar (e.g., DB-624, DB-210, DB-1701, OV-17) stationary phases. The achieved accuracy lies in the range of 16–50 in terms of the mean absolute error for several stationary phases and test data sets. We also demonstrate that our approach can be directly applied to the prediction of the second dimension retention times (GC × GC) if a large enough data set is available. The achieved accuracy is considerably better compared with the previous results obtained using linear quantitative structure-retention relationships and ACD ChromGenius software. The source code and pre-trained models are available online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.