IntroductionMetabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models.ObjectivesWe hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis.MethodsWe compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks.ResultsThere was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice.ConclusionThe size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.Electronic supplementary materialThe online version of this article (10.1007/s11306-019-1612-4) contains supplementary material, which is available to authorized users.
Background
A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike.
Aim of Review
To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science.
Key Scientific Concepts of Review
This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.
The application of large-scale metabolomic pro ling provides new opportunities to realize the potential of omics-based precision medicine with regard to asthma. We leveraged over 14,000 individuals from four distinct epidemiological studies. We identi ed and independently replicated seventeen steroid metabolites that were signi cantly reduced in individuals with prevalent asthma. Importantly steroid levels were reduced among all individuals with asthma regardless of medication use; however, the largest reduction was associated with inhaled corticosteroids use (ICS) that was further con rmed in a four-year ICS clinical trial. Cortisol levels extracted from electronic medical records con rmed that cortisol is reduced among asthmatics taking ICS over the entire 24-hour period, compared with all other groups.Clinical-grade adrenal suppression in asthmatics on ICS, resulting from substantial reductions in steroid metabolites, represents a larger public health problem than previously recognized. Regular cortisol testing may identify at-risk individuals, enabling personalized treatment modi cations and improving overall patient care.
Introduction Metabolomics data is commonly modelled multivariately using partial least squares discriminant analysis (PLS-DA). Its success is primarily due to ease of interpretation, through projection to latent structures, and transparent assessment of feature importance using regression coefficients and Variable Importance in Projection scores. In recent years several non-linear machine learning (ML) methods have grown in popularity but with limited uptake essentially due to convoluted optimisation and interpretation. Artificial neural networks (ANNs) are a non-linear projection-based ML method that share a structural equivalence with PLS, and as such should be amenable to equivalent optimisation and interpretation methods. Objectives We hypothesise that standardised optimisation, visualisation, evaluation and statistical inference techniques commonly used by metabolomics researchers for PLS-DA can be migrated to a non-linear, single hidden layer, ANN. Methods We compared a standardised optimisation, visualisation, evaluation and statistical inference techniques workflow for PLS with the proposed ANN workflow. Both workflows were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks on GitHub. Results The migration of the PLS workflow to a non-linear, single hidden layer, ANN was successful. There was a similarity in significant metabolites determined using PLS model coefficients and ANN Connection Weight Approach. Conclusion We have shown that it is possible to migrate the standardised PLS-DA workflow to simple non-linear ANNs. This result opens the door for more widespread use and to the investigation of transparent interpretation of more complex ANN architectures.
The purpose of this study was to analyze the association between plasma metabolite levels and dark adaptation (DA) in age-related macular degeneration (AMD). This was a cross-sectional study including patients with AMD (early, intermediate, and late) and control subjects older than 50 years without any vitreoretinal disease. Fasting blood samples were collected and used for metabolomic profiling with ultra-performance liquid chromatography–mass spectrometry (LC-MS). Patients were also tested with the AdaptDx (MacuLogix, Middletown, PA, USA) DA extended protocol (20 min). Two measures of dark adaptation were calculated and used: rod-intercept time (RIT) and area under the dark adaptation curve (AUDAC). Associations between dark adaption and metabolite levels were tested using multilevel mixed-effects linear modelling, adjusting for age, gender, body mass index (BMI), smoking, race, AMD stage, and Age-Related Eye Disease Study (AREDS) formulation supplementation. We included a total of 71 subjects: 53 with AMD (13 early AMD, 31 intermediate AMD, and 9 late AMD) and 18 controls. Our results revealed that fatty acid-related lipids and amino acids related to glutamate and leucine, isoleucine and valine metabolism were associated with RIT (p < 0.01). Similar results were found when AUDAC was used as the outcome. Fatty acid-related lipids and amino acids are associated with DA, thus suggesting that oxidative stress and mitochondrial dysfunction likely play a role in AMD and visual impairment in this condition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.