Background: In order to design concepts for a new general-purpose chemical format we analyzed the strengths and weaknesses of current formats for common chemical data. While the new format is discussed more in the next article, here we describe our software tools and two stage analysis procedure that supplied the necessary information for the development. The chemical formats analyzed in both stages were: CDX, CDXML, CML, CTfile and XDfile. In addition the following formats were included in the first stage only: CIF, InChI, NCBI ASN.1, NCBI XML, PDB, PDBx/mmCIF, PDBML, SMILES, SLN and Mol2. Results: A two stage analysis process devised for both XML (Extensible Markup Language) and non-XML formats enabled us to verify if and how potential advantages of XML are utilized in the widely used general-purpose chemical formats. In the first stage we accumulated information about analyzed formats and selected the formats with the most general-purpose chemical functionality for the second stage. During the second stage our set of software quality requirements was used to assess the benefits and issues of selected formats. Additionally, the detailed analysis of XML formats structure in the second stage helped us to identify concepts in those formats. Using these concepts we came up with the concise structure for a new chemical format, which is designed to provide precise built-in validation capabilities and aims to avoid the potential issues of analyzed formats. Conclusions: We believe our analysis methodology is potentially highly reusable and could be easily adapted even for domains outside the chemistry area. It is because the methodology and software tools will need only few changes, although analyzed formats and software quality requirements for a format will differ according to the given domain.
Background: We wish to introduce a new chemical format called UCM (Universal Chemical Markup). The format is based on XML (Extensible Markup Language) and its first version focuses on recording chemical structures and their properties. Results: UCM currently supports structures containing isotopes, ions and various types of bonding including delocalized bonds. Properties can be expressed by combining UCM with UnitsML (Units Markup Language). Using UnitsML one defines quantities with scientific units, and then refers to them in UCM when recording property values. Users can also add literature references with BibTeXML (BibTeX Markup Language) and annotate the recorded data using plain text or XHTML (Extensible Hypertext Markup Language) descriptions. In contrast to presently available general-purpose chemical formats, UCM offers built-in validation, which combines both grammar and pattern-based XML schema languages.Thus, all recorded data can be precisely validated by UCM schemas in standard XML validators. Conclusions: We developed the structure for UCM from scratch on the basis of an analysis described in our previous article. Starting from scratch allowed us to integrate BibTeXML, UnitsML and XHTML as well as chemical line notations and identifiers into UCM.It also helped us to avoid unnecessary redundant parts and create the implementation that aims to minimize ambiguity and is designed to be easily extensible in the future.PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1336v1 | CC-BY 4.0 Open Access | rec ABSTRACT BackgroundWe wish to introduce a new chemical format called UCM (Universal Chemical Markup). The format is based on XML (Extensible Markup Language) and its first version focuses on recording chemical structures and their properties. ResultsUCM currently supports structures containing isotopes, ions and various types of bonding including delocalized bonds. Properties can be expressed by combining UCM with UnitsML (Units Markup Language). Using UnitsML one defines quantities with scientific units, and then refers to them in UCM when recording property values. Users can also add literature references with BibTeXML (BibTeX Markup Language) and annotate the recorded data using plain text or XHTML (Extensible Hypertext Markup Language) descriptions. In contrast to presently available general-purpose chemical formats, UCM offers built-in validation, which combines both grammar and pattern-based XML schema languages. Thus, all recorded data can be precisely validated by UCM schemas in standard XML validators. ConclusionsWe developed the structure for UCM from scratch on the basis of an analysis described in our previous article. Starting from scratch allowed us to integrate BibTeXML, UnitsML and XHTML as well as chemical line notations and identifiers into UCM. It also helped us to avoid unnecessary redundant parts and create the implementation that aims to minimize ambiguity and is designed to be easily extensible in the future.
Background: In order to design concepts for a new general-purpose chemical format we analyzed the strengths and weaknesses of current formats for common chemical data.While the new format is discussed more in the next article, here we describe our software tools and two stage analysis procedure that supplied the necessary information for the development. The chemical formats analyzed in both stages were: CDX, CDXML, CML, CTfile and XDfile. In addition the following formats were included in the first stage only: CIF, InChI, NCBI ASN.1, NCBI XML, PDB, PDBx/mmCIF, PDBML, SMILES, SLN and Mol2. Results:A two stage analysis process devised for both XML (Extensible Markup Language) and non-XML formats enabled us to verify if and how potential advantages of XML are utilized in the widely used general-purpose chemical formats. In the first stage we accumulated information about analyzed formats and selected the formats with the most general-purpose chemical functionality for the second stage. During the second stage our set of software quality requirements was used to assess the benefits and issues of selected formats. Additionally, the detailed analysis of XML formats structure in the second stage helped us to identify concepts in those formats. Using these concepts we came up with the concise structure for a new chemical format, which is designed to provide precise built-in validation capabilities and aims to avoid the potential issues of analyzed formats. Conclusions:We believe our analysis methodology is potentially highly reusable and could be easily adapted even for domains outside the chemistry area. It is because the methodology and software tools will need only few changes, although analyzed formats and software quality requirements for a format will differ according to the given domain. ABSTRACT BackgroundIn order to design concepts for a new general-purpose chemical format we analyzed the strengths and weaknesses of current formats for common chemical data. While the new format is discussed more in the next article, here we describe our software tools and two stage analysis procedure that supplied the necessary information for the development. The chemical formats analyzed in both stages were: CDX, CDXML, CML, CTfile and XDfile. In addition the following formats were included in the first stage only: CIF, InChI, NCBI ASN.1, NCBI XML, PDB, PDBx/mmCIF, PDBML, SMILES, SLN and Mol2.
Thus, all recorded data can be precisely validated by UCM schemas in standard XML validators. Conclusions: We developed the structure for UCM from scratch on the basis of an analysis described in our previous article. Starting from scratch allowed us to integrate BibTeXML, UnitsML and XHTML as well as chemical line notations and identifiers into UCM.It also helped us to avoid unnecessary redundant parts and create the implementation that aims to minimize ambiguity and is designed to be easily extensible in the future.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.