Sabrina Jaeger scite author profile

Inspired by natural language processing techniques, we here introduce Mol2vec, which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Like the Word2vec models, where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that point in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing the vectors of the individual substructures and, for instance, be fed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pretrained once, yields dense vector representations, and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as a reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment-independent and thus can also be easily used for proteins with low sequence similarities.

show abstract

Pocketome of Human Kinases: Prioritizing the ATP Binding Sites of (Yet) Untapped Protein Kinases for Drug Discovery

Volkamer

Eid

Turk

et al. 2015

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Protein kinases are involved in a variety of diseases including cancer, inflammation, and autoimmune disorders. Although the development of new kinase inhibitors is a major focus in pharmaceutical research, a large number of kinases remained so far unexplored in drug discovery projects. The selection and assessment of targets is an essential but challenging area. Today, a few thousands of experimentally determined kinase structures are available, covering about half of the human kinome. This large structural source allows guiding the target selection via structure-based druggability prediction approaches such as DoGSiteScorer. Here, a thorough analysis of the ATP pockets of the entire human kinome in the DFG-in state is presented in order to prioritize novel kinase structures for drug discovery projects. For this, all human kinase X-ray structures available in the PDB were collected, and homology models were generated for the missing part of the kinome. DoGSiteScorer was used to calculate geometrical and physicochemical properties of the ATP pockets and to predict the potential of each kinase to be druggable. The results indicate that about 75% of the kinome are in principle druggable. Top ranking structures comprise kinases that are primary targets of known approved drugs but additionally point to so far less explored kinases. The presented analysis provides new insights into the druggability of ATP binding pockets of the entire kinome. We anticipate this comprehensive druggability assessment of protein kinases to be helpful for the community to prioritize so far untapped kinases for drug discovery efforts.

show abstract

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

Jaeger¹,

Turk²

2017

Preprint

View full text Add to dashboard Cite

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

show abstract

Machine learning prediction of cyanobacterial toxin (microcystin) toxicodynamics in humans

Altaner

Jaeger

Fotler

et al. 2019

ALTEX

View full text Add to dashboard Cite

and higher temperature may provide a growth advantage for toxin-producing species (Kleinteich et al., 2012).MCs, which are produced by several cyanobacteria species, e.g., Microcystis spp., Dolichospermum spp. or Planktothrix spp., in water bodies worldwide (Preece et al., 2017), represent one of the toxin types most frequently associated with drinking water, food supplement and/or food contamination and have resulted in human health morbidity and mortality. Structurally, MC are cyclic heptapeptides consisting of common L-amino acids, but also uncommon and unique amino acids. Their general structure is cyclo(). X and Z stand for variable L-amino acids, while β-D-MeAsp is erythro-β-D-methylaspartate, ADDA is (2S,3S,8S,9S,4E,6E)-3-amino-9-methoxy-2,6,8-trimethyl-10phenyl-4,6-decadienoic acid and Mdha is N-methyldehydroalanine. The variable positions, along with various (de)methylation sites (Fig. 1, Tab. S1 1 ), provide for currently 248 known MC congeners (Spoof and Catherine, 2017), albeit new MC congeners are continuously being discovered. However, contrary to a recent

show abstract

Challenges for Brain Data Analysis in VR Environments

Jaeger

Klein

Joos

et al. 2019

View full text Add to dashboard Cite

Visual Analytics for Cheetah Behaviour Analysis

Klein

Jaeger

Melzheimer

et al. 2019

View full text Add to dashboard Cite

Erratum to Machine learning prediction of cyanobacterial toxin (microcystin) toxicodynamics in humans

Altaner

Jaeger

Fotler

et al. 2020

ALTEX

View full text Add to dashboard Cite

Visual analytics of sensor movement data for cheetah behaviour analysis

Klein

Jaeger

Melzheimer

et al. 2021

J Vis

View full text Add to dashboard Cite

Current tracking technology such as GPS data loggers allows biologists to remotely collect large amounts of movement data for a large variety of species. Extending, and often replacing interpretation based on observation, the analysis of the collected data supports research on animal behaviour, on impact factors such as climate change and human intervention on the globe, as well as on conservation programs. However, this analysis is difficult, due to the nature of the research questions and the complexity of the data sets. It requires both automated analysis, for example, for the detection of behavioural patterns, and human inspection, for example, for interpretation, inclusion of previous knowledge, and for conclusions on future actions and decision making. For this analysis and inspection, the movement data needs to be put into the context of environmental data, which helps to interpret the behaviour. Thus, a major challenge is to design and develop methods and intuitive interfaces that integrate the data for analysis by biologists. We present a concept and implementation for the visual analysis of cheetah movement data in a web-based fashion that allows usage both in the field and in office environments. Graphic abstract

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sabrina Jaeger

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

Pocketome of Human Kinases: Prioritizing the ATP Binding Sites of (Yet) Untapped Protein Kinases for Drug Discovery

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

Machine learning prediction of cyanobacterial toxin (microcystin) toxicodynamics in humans

Challenges for Brain Data Analysis in VR Environments

Visual Analytics for Cheetah Behaviour Analysis

Erratum to Machine learning prediction of cyanobacterial toxin (microcystin) toxicodynamics in humans

Visual analytics of sensor movement data for cheetah behaviour analysis

Contact Info

Product

Resources

About