A simple aqueous-processable alkaline ionomer (amenable to scale-up) has been developed for enhancing electrode/electrolyte interfaces in clean energy devices (e.g. alkaline polymer electrolyte membrane fuel cells). The water 10 uptake of the alkaline ionomer is tuneable allowing its use as a tool for fundamental studies into these interfaces.
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a
chemical from its standard International Chemical Identifier (InChI). The model uses two stacks
of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in
state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes
input and output into words or sub-words, our model processes the InChI and predicts the
2
IUPAC name character by character. The model was trained on a dataset of 10 million
InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online
PubChem service. Training took five days on a Tesla K80 GPU, and the model achieved test-set
accuracies of 95% (character-level) and 91% (whole name). The model performed particularly
well on organics, with the exception of macrocycles. The predictions were less accurate for
inorganic compounds, with a character-level accuracy of 71%. This can be explained by inherent
limitations in InChI for representing inorganics, as well as low coverage (1.4 %) of the training
data.
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a
chemical from its standard International Chemical Identifier (InChI). The model uses two stacks
of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in
state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes
input and output into words or sub-words, our model processes the InChI and predicts the
2
IUPAC name character by character. The model was trained on a dataset of 10 million
InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online
PubChem service. Training took five days on a Tesla K80 GPU, and the model achieved test-set
accuracies of 95% (character-level) and 91% (whole name). The model performed particularly
well on organics, with the exception of macrocycles. The predictions were less accurate for
inorganic compounds, with a character-level accuracy of 71%. This can be explained by inherent
limitations in InChI for representing inorganics, as well as low coverage (1.4 %) of the training
data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.