2022
DOI: 10.1186/s13321-022-00640-5
|View full text |Cite
|
Sign up to set email alerts
|

TUCAN: A molecular identifier and descriptor applicable to the whole periodic table from hydrogen to oganesson

Abstract: TUCAN is a canonical serialization format that is independent of domain-specific concepts of structure and bonding. The atomic number is the only chemical feature that is used to derive the TUCAN format. Other than that, the format is solely based on the molecular topology. Validation is reported on a manually curated test set of molecules as well as a library of non-chemical graphs. The serialization procedure generates a canonical “tuple-style” output which is bidirectional, allowing the TUCAN string to serv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(8 citation statements)
references
References 38 publications
0
4
0
1
Order By: Relevance
“…To this end, consistent and canonical identifiers for metal-containing compounds are urgently needed, as they are currently not accurately represented in chemical databases using standard chemical notation, such as SMILES (Simplified Molecular Input Line Entry System) strings. Recent efforts in this area have yielded promising candidates (for example, TUCAN 271 or SELFIES 272 ). Once standardized structure nomenclatures for metal complexes are in place, efficient and open machine-processing tools can be developed for databases and machine-learning tasks 273 .…”
Section: Discussionmentioning
confidence: 99%
“…To this end, consistent and canonical identifiers for metal-containing compounds are urgently needed, as they are currently not accurately represented in chemical databases using standard chemical notation, such as SMILES (Simplified Molecular Input Line Entry System) strings. Recent efforts in this area have yielded promising candidates (for example, TUCAN 271 or SELFIES 272 ). Once standardized structure nomenclatures for metal complexes are in place, efficient and open machine-processing tools can be developed for databases and machine-learning tasks 273 .…”
Section: Discussionmentioning
confidence: 99%
“…One reason is that most conventional methods to generate descriptors or feature vectors for molecules rely on molecular representations, such as SMILES, which cannot be easily extrapolated onto metal complexes with multiple coordinating ligands. While some solutions have been proposed, they have not been widely applied yet [33,34] . In our case we took advantage of the fact that all 288 tested compounds and any compound of this class we wished to predict shared several similarities.…”
Section: Resultsmentioning
confidence: 99%
“…To make generative models more application-relevant, new methods are required that e.g., allow to include constraints in the design process, in the simplest case symmetries of generated molecules and materials, or in more complex scenarios additional (empirical or analytical) objectives such as synthesizability. A large step in that direction is new representations, not only for organic molecules but also for (3D) materials [263][264][265] .…”
Section: Discussionmentioning
confidence: 99%