Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.
There is a growing drive in the chemistry community to exploit rapidly growing robotic technologies along with artificial intelligence-based approaches. Applying this to chemistry requires a holistic approach to chemical synthesis design and execution. Here, we outline a universal approach to this problem beginning with an abstract representation of the practice of chemical synthesis that then informs the programming and automation required for its practical realization. Using this foundation to construct closed-loop robotic chemical search engines, we can generate new discoveries that may be verified, optimized, and repeated entirely automatically. These robots can perform chemical reactions and analyses much faster than can be done manually. As such, this leads to a road map whereby molecules can be discovered, optimized, and made on demand from a digital code. Automation in Chemical SynthesisMethodologies for the automation of chemical synthesis, optimization, and discovery have not generally been designed for the realities of laboratory-based research, tending instead to focus on engineering solutions to practical problems. We argue that the potential of rapidly developing technologies (e.g., machine learning and robotics) are more fully realized by operating seamlessly with the way that synthetic chemists currently work ( Figure 1) [1]. This is because the organic chemist often works by thinking backwards as much as they do forwards when planning a synthetic procedure. To reproduce this fundamental mode of operation, a new universal approach to the automated exploration of chemical space is needed that combines an abstraction of chemical synthesis with robotic hardware and closed-loop programming [2,3]. However, this leads chemists to constantly test the reactions with different synthetic parameters and conditions. The alternative to this problem, as shown in this opinion article, is the development of an approach to universal chemistry using a programming language with automation in combination with machine learning and artificial intelligence (AI).Chemists already benefit from algorithms in the field of chemometrics and, therefore, automation is one step forward that might help chemists to navigate and search chemical space more quickly, efficiently, and importantly, without bias. Chemometrics is a field that employs a broad range of algorithms to solve chemistry-related problems and has been well established over the past 50 years [4]. Figure 2 presents a standard chemometrics workflow for processing data. The process begins with data that may be of various formats that depend upon the experiment type and/or posed question. The next step is data preprocessing, which covers a variety of procedures depending on the type of data analyzed (e.g., peak detection, input of missing data, and/or normalization). This process is followed by statistical modeling, which is divided into supervised and unsupervised approaches. Probably one of the most well-known unsupervised approaches is principal component analysis (s...
The search for alien life is hard because we do not know what signatures are unique to life. We show why complex molecules found in high abundance are universal biosignatures and demonstrate the first intrinsic experimentally tractable measure of molecular complexity, called the molecular assembly index (MA). To do this we calculate the complexity of several million molecules and validate that their complexity can be experimentally determined by mass spectrometry. This approach allows us to identify molecular biosignatures from a set of diverse samples from around the world, outer space, and the laboratory, demonstrating it is possible to build a life detection experiment based on MA that could be deployed to extraterrestrial locations, and used as a complexity scale to quantify constraints needed to direct prebiotically plausible processes in the laboratory. Such an approach is vital for finding life elsewhere in the universe or creating de-novo life in the lab.
Recently, automated robotic systems have become very efficient, thanks to improved coupling between sensor systems and algorithms, of which the latter have been gaining significance thanks to the increase in computing power over the past few decades. However, intelligent automated chemistry platforms for discovery orientated tasks need to be able to cope with the unknown, which is a profoundly hard problem. In this Outlook, we describe how recent advances in the design and application of algorithms, coupled with the increased amount of chemical data available, and automation and control systems may allow more productive chemical research and the development of chemical robots able to target discovery. This is shown through examples of workflow and data processing with automation and control, and through the use of both well-used and cutting-edge algorithms illustrated using recent studies in chemistry. Finally, several algorithms are presented in relation to chemical robots and chemical intelligence for knowledge discovery.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.