Preliminary
compound identification and peak annotation in gas
chromatography–mass spectrometry is usually made using mass
spectral databases. There are a few algorithms that enable performing
a search of a spectrum in a large mass spectral library. In many cases,
a library search procedure returns a wrong answer even if a correct
compound is contained in a library. In this work, we present a deep
learning driven approach to a library search in order to reduce the
probability of such cases. Machine learning ranking (learning to rank)
is a class of machine learning and deep learning algorithms that perform
a comparison (ranking) of objects. This work introduces the usage
of deep learning ranking for small molecules identification using
low-resolution electron ionization mass spectrometry. Instead of simple
similarity measures for two spectra, such as the dot product or the
Euclidean distance between vectors that represent spectra, a deep
convolutional neural network is used. The deep learning ranking model
outperforms other approaches and enables reducing a fraction of wrong
answers (at rank-1) by 9–23% depending on the used data set.
Spectra from the Golm Metabolome Database, Human Metabolome Database,
and FiehnLib were used for testing the model.
Prediction of gas chromatographic retention indices based on compound structure is an important task for analytical chemistry. The predicted retention indices can be used as a reference in a mass spectrometry library search despite the fact that their accuracy is worse in comparison with the experimental reference ones. In the last few years, deep learning was applied for this task. The use of deep learning drastically improved the accuracy of retention index prediction for non-polar stationary phases. In this work, we demonstrate for the first time the use of deep learning for retention index prediction on polar (e.g., polyethylene glycol, DB-WAX) and mid-polar (e.g., DB-624, DB-210, DB-1701, OV-17) stationary phases. The achieved accuracy lies in the range of 16–50 in terms of the mean absolute error for several stationary phases and test data sets. We also demonstrate that our approach can be directly applied to the prediction of the second dimension retention times (GC × GC) if a large enough data set is available. The achieved accuracy is considerably better compared with the previous results obtained using linear quantitative structure-retention relationships and ACD ChromGenius software. The source code and pre-trained models are available online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.