The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena), maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), offers those producing data an open and supported platform for the management, archiving, publication, and dissemination of data; and to the scientific community as a whole, it offers a globally comprehensive data set through a host of data discovery and retrieval tools. Here, we describe recent updates to the ENA’s submission and retrieval services as well as focused efforts to improve connectivity, reusability, and interoperability of ENA data and metadata.
Background: Metadata attributes of sequences that accurately reference their biological sources, as specimens or other materials of origin, and link with natural history collections, are essential to facilitate the connections between different fields in life sciences and promote reusability of data. However, metadata used to reference the biological source of sequences available within the molecular data repositories are not always well structured or comprehensive. Methods: Within the scope of the Horizon 2020 project Biodiversity Community Integrated Knowledge Library (BiCIKL), we have developed a tool, the European Nucleotide Archive (ENA) Source Attribute Helper Application Programming Interface (API), to help users accurately report biological source-related sequence and sample attributes. This tool currently focuses on the attributes in which specimens, cultures or other materials are identified, from which the sequence data were derived, and uses curated data to obtain the unique codes for the institutions and collections holding the vouchers. The API's main functions include the presentation of metadata associated with queried institutions or collections, validation of institution and collection codes in the attribute strings provided by the user, and the construction of an attribute string based on user-entered data. The API does not however support the search of voucher specimen codes, as these need to be obtained directly from the voucher institutions. We describe the API and discuss use cases for its different endpoints. The API is available at https://www.ebi.ac.uk/ena/sah/api/. Conclusions: We expect the API to promote and support the initial submission and any subsequent curation of biological source attributes, and hereby contribute to better links between sequence data and natural history collections, and hence on to taxonomy and biodiversity research, towards increasing the discoverability, reusability and impact of data.
Metadata management for sequence data is essential for the accurate description of Earth’s biodiversity. Within metadata attributes, those that reference the biological sources of sequences and samples and allow linking to the specimen or sample of origin are fundamental for facilitating connections between molecular biology, taxonomy, systematic biology and biodiversity research, increasing the discoverability and usability of data by researchers worldwide. Sequence data is publicly archived at the International Nucleotide Sequence Database Collaboration (INSDC) that includes the National Centre for Biotechnology Information (NCBI), the DNA Data Bank of Japan (DDBJ) and the European Nucleotide Archive (ENA). Sequences stored at INSDC have associated a considerable range of metadata, including attributes related to its biological source, such as references to natural history collections or culture collections. But, these source attributes are not always submitted or may be incomplete, limiting the association of the sequence records to the original source material, hampering further data connections (e.g., biological data associated with the voucher or species distribution data). Therefore, we have developed the ENA Source Attribute Helper API, a tool that aims to assist users on the submission of accurate attributes referring to the biological source of samples and sequence data. This tool was developed within the scope of BiCIKL (Biodiversity Community Integrated Knowledge Library) (Penev et al. 2022), a Horizon 2020 project which targets building a wide, biodiversity related community for connecting data along the different axes of biodiversity research. The first version of the tool was designed to support correct annotation of the attributes that identify the source material from which the sample or sequence were obtained, namely /specimen_voucher, /culture_collection, and /biomaterial (INSDC 2021). These attributes follow a Darwin Core Triplet format (Wieczorek et al. 2012), composed of institution code, collection code and the specimen, culture, or material identifier, accordingly. Since the submission of the biological source attributes to the INSDC may be performed both when data is initially uploaded or on following updates using a variety of tools, we developed the API as an open source tool that is publicly accessible and may be used as a free-standing service. The API is built using Representational State Transfer (REST) API Architecture and it is designed to use the data available in the NCBI BioCollections (Sharma et al. 2018). NCBI Biocollections is a curated database of metadata for natural history collections, associated with records in INSDC, that includes the institution and collection codes. The API main functions include the querying of the metadata (the API presents both exact matches and similar matches) for the institutions and collections based on the user input, validation of institution and collection codes in the attribute strings provided by the user, and the construction of the attribute string based on the user-provided information. The API does not include the search or validation of the voucher specimen codes. The API is designed in a way that it can be extended easily for any future enhancements and initially expected to promote and support the submission and any subsequent curation of better structured and more richly described source data. We expect this tool to contribute to better connected biodiversity data and hence provide a stronger foundation to strengthen the value of natural history collections, taxonomic expertise, and biodiversity knowledge.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.