Adamo Young scite author profile

Liquid Chromatography coupled to Tandem Mass Spectrometry (LC-MS/MS) based methods are currently the top choice for high-throughput, quantitative measurements of the proteome. While traditional proteomics LC-MS/MS methods can suffer from issues such as low reproducibility and quantitative accuracy due to its stochastic nature, recent improvements in acquisition protocols have resulted in methods that can overcome these challenges. Data-independent acquisition (DIA) is a novel mass spectrometric method that does so by using a deterministic acquisition strategy. These new approaches will allow researchers to apply MS on more complex samples, however, existing heuristic and expert-knowledge based methods will struggle with keeping pace of the increasing complexity of the resulting data. Deep learning (DL) based methods have been shown to be more adept at handling large amounts of complex data than traditional methods in many other fields, such as computer vision and natural language processing. Proteomics is also entering a phase where the size and complexity of the data will require us to look towards scalable and data-driven DL pipelines.

show abstract

MassFormer: Tandem Mass Spectrum Prediction with Graph Transformers

Young¹,

Wang²,

Röst³

2021

Preprint

View full text Add to dashboard Cite

Mass spectrometry is a key tool in the study of small molecules, playing an important role in metabolomics, drug discovery, and environmental chemistry. Tandem mass spectra capture fragmentation patterns that provide key structural information about a molecule and help with its identification. Practitioners often rely on spectral library searches to match unknown spectra with known compounds. However, such search-based methods are limited by availability of reference experimental data. In this work we show that graph transformers can be used to accurately predict tandem mass spectra. Our model, MassFormer, outperforms competing deep learning approaches for spectrum prediction, and includes an interpretable attention mechanism to help explain predictions. We demonstrate that our model can be used to improve reference library coverage on a synthetic molecule identification task. Through quantitative analysis and visual inspection, we verify that our model recovers prior knowledge about the effect of collision energy on the generated spectrum. We evaluate our model on different types of mass spectra from two independent MS datasets and show that its performance generalizes. Code available at github.com/Roestlab/massformer.

show abstract

SELFIES and the future of molecular string representations

Krenn¹,

Ai²,

Barthel³

et al. 2022

Preprint

View full text Add to dashboard Cite

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for

show abstract

A graph neural network approach for molecule carcinogenicity prediction

Fradkin

Young

Atanackovic

et al. 2022

View full text Add to dashboard Cite

Motivation Molecular carcinogenicity is a preventable cause of cancer, but systematically identifying carcinogenic compounds, which involves performing experiments on animal models, is expensive, time consuming and low throughput. As a result, carcinogenicity information is limited and building data-driven models with good prediction accuracy remains a major challenge. Results In this work, we propose CONCERTO, a deep learning model that uses a graph transformer in conjunction with a molecular fingerprint representation for carcinogenicity prediction from molecular structure. Special efforts have been made to overcome the data size constraint, such as multi-round pre-training on related but lower quality mutagenicity data, and transfer learning from a large self-supervised model. Extensive experiments demonstrate that our model performs well and can generalize to external validation sets. CONCERTO could be useful for guiding future carcinogenicity experiments and provide insight into the molecular basis of carcinogenicity. Availability and implementation The code and data underlying this article are available on github at https://github.com/bowang-lab/CONCERTO

show abstract

Supervised topic modeling for predicting molecular substructure from mass spectrometry

et al. 2021

View full text Add to dashboard Cite

Small-molecule metabolites are principal actors in myriad phenomena across biochemistry and serve as an important source of biomarkers and drug candidates. Given a sample of unknown composition, identifying the metabolites present is difficult given the large number of small molecules both known and yet to be discovered. Even for biofluids such as human blood, building reliable ways of identifying biomarkers is challenging. A workhorse method for characterizing individual molecules in such untargeted metabolomics studies is tandem mass spectrometry (MS/MS). MS/MS spectra provide rich information about chemical composition. However, structural characterization from spectra corresponding to unknown molecules remains a bottleneck in metabolomics. Current methods often rely on matching to pre-existing databases in one form or another. Here we develop a preprocessing scheme and supervised topic modeling approach to identify modular groups of spectrum fragments and neutral losses corresponding to chemical substructures using labeled latent Dirichlet allocation (LLDA) to map spectrum features to known chemical structures. These structures appear in new unknown spectra and can be predicted. We find that LLDA is an interpretable and reliable method for structure prediction from MS/MS spectra. Specifically, the LLDA approach has the following advantages: (a) molecular topics are interpretable; (b) A practitioner can select any set of chemical structure labels relevant to their problem; (c ) LLDA performs well and can exceed the performance of other methods in predicting substructures in novel contexts.

show abstract

A Graph Neural Network Approach to Molecule Carcinogenicity Prediction

Fradkin

Young

Atanackovic

et al. 2021

Preprint

View full text Add to dashboard Cite

Molecular carcinogenicity is a preventable cause of cancer, however, most experimental testing of molecular compounds is an expensive and time consuming process, making high throughput experimental approaches infeasible. In recent years, there has been substantial progress in machine learning techniques for molecular property prediction. In this work, we propose a model for carcinogenicity prediction, CONCERTO, which uses a graph transformer in conjunction with a molecular fingerprint representation, trained on multi-round muta-genicity and carcinogenicity objectives. To train and validate CONCERTO, we augment the training dataset with more informative labels and utilize a larger external validation dataset. Extensive experiments demonstrate that our model yields results superior to alternate approaches for molecular carcinogenicity prediction.

show abstract

Primary structure of ovine tumor necrosis factor alpha cDNA

1990

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Adamo Young

SELFIES and the future of molecular string representations

Machine Learning in Mass Spectrometric Analysis of DIA Data

MassFormer: Tandem Mass Spectrum Prediction with Graph Transformers

SELFIES and the future of molecular string representations

A graph neural network approach for molecule carcinogenicity prediction

Supervised topic modeling for predicting molecular substructure from mass spectrometry

A Graph Neural Network Approach to Molecule Carcinogenicity Prediction

Primary structure of ovine tumor necrosis factor alpha cDNA

Contact Info

Product

Resources

About