The increased number of data repositories has greatly increased the availability of open data. To enable broad discovery and access to research dataset, some data repositories have begun leveraging data discovery services from commercial search engines by embedding structured metadata markup in dataset web landing pages using vocabularies from Schema.org and extensions. This paper aims to examine metadata interoperability for supporting global data discovery. Specifically, the paper reports a survey on which metadata schema has been adopted by participating data repositories, and presents an analysis of crosswalks from fourteen research data schemas to Schema.org. The analysis indicates most descriptive metadata are interoperable among the schemas, the most inconsistent mapping is the rights metadata, and a large gap exists in the structural metadata and controlled vocabularies to specify various property values. The analysis and collated crosswalks can serve as a reference for data repositories when they develop crosswalks from their own schemasto Schema.org, and provide the research data community a benchmark of structured metadata implementation.
No abstract
Background: The coronavirus disease 2019 (COVID-19) global pandemic required a rapid and effective response. This included ethical and legally appropriate sharing of data. The European Commission (EC) called upon the Research Data Alliance (RDA) to recruit experts worldwide to quickly develop recommendations and guidelines for COVID-related data sharing. Purpose: The purpose of the present work was to explore how the RDA succeeded in engaging the participation of its community of scientists in a rapid response to the EC request. Methods: A survey questionnaire was developed and distributed among RDA COVID-19 work group members. A mixed-methods approach was used for analysis of the survey data. Results: The three constructs of radical collaboration (inclusiveness, distributed digital practices, productive and sustainable collaboration) were found to be well supported in both the quantitative and qualitative analyses of the survey data. Other social factors, such as motivation and group identity were also found to be important to the success of this extreme collaborative effort. Conclusions: Recommendations and suggestions for future work were formulated for consideration by the RDA to strengthen effective expert collaboration and interdisciplinary efforts.
The concept of Data Management Plan (DMP) has emerged as a fundamental tool to help researchers through the systematical management of data. The Research Data Alliance DMP Common Standard (DCS) working group developed a core set of universal concepts characterising a DMP in the pursuit of producing a DMP as a machine-actionable information artefact, i.e.,machine-actionable Data Management Plan (maDMP). The technology-agnostic approach of the current maDMP specification: (i) does not explicitly link to related data models or ontologies, (ii) has no standardised way to describe controlled vocabularies, and (iii) is extensible, but has no clear mechanism to distinguish between the core specification and its extensions. Currently, the maDMP specification provides a JSON serialisation and schema to operationalise the approach. Such approach, however, does not address the concerns above. This paper reports on the community effort to create the DMP Common Standard Ontology (DCSO) as a serialisation of the DCS core concepts, with a particular focus on a detailed description of the components of the ontology. Our initial result shows that the proposed DCSO can become a suitable candidate for a reference serialisation of the DMP Common Standard.
RO-Crate (Soiland-Reyes et al. 2022) is a lightweight method to package research outputs along with their metadata, based on Linked Data principles (Bizer et al. 2009) and W3C standards. RO-Crate provides a flexible mechanism for researchers archiving and publishing rich data packages (or any other research outcome) by capturing their dependencies and context. However, additional measures should be taken to ensure that a crate is also following the FAIR principles (Wilkinson 2016), including consistent use of persistent identifiers, provenance, community standards, clear machine/human-readable licensing for metadata and data, and Web publication of RO-Crates. The FAIR Digital Object (FDO) approach (De Smedt et al. 2020) gives a set of recommendations that aims to improve findability, accessibility, interoperability and reproducibility for any digital object, allowing implementation through different protocols or standards. Here we present how we have followed the FDO recommendations and turned research outcomes into FDOs by publishing RO-Crates on the Web using HTTP, following best practices for Linked Data. We highlight challenges and advantages of the FDO approach, and reflect on what is required for an FDO profile to achieve FAIR RO-Crates. The implementation allows for a broad range of use cases, across scientific domains. A minimal RO-Crate may be represented as a persistent URI resolving to a summary website describing the outputs in a scientific investigation (e.g. https://w3id.org/dgarijo/ro/sepln2022 with links to the used datasets along with software). One of the advantages of RO-Crates is flexibility, particularly regarding the metadata accompanying the actual research outcome. RO-Crate extends schema.org, a popular vocabulary for describing resources on the Web (Guha et al. 2016). A generic RO-Crate is not required to be typed beyond Dataset*1. In practice, RO-Crates declare conformance to particular profiles, allowing processing based on the specific needs and assumptions of a community or usage scenario. This, effectively, makes RO-Crates typed and thus machine-actionable. RO-Crate profiles serve as metadata templates, making it easier for communities to agree and build upon their own metadata needs. RO-Crates have been combined with machine-actionable Data Management Plans (maDMPs) to automate and facilitate management of research data (Miksa et al. 2020). This mapping allows RO-Crates to be generated out of maDMPs and vice versa. The ELIXIR Software Management Plans (Alves et al. 2021) is planning to move their questionnaire to a machine-actionable format with RO-Crate. ELIXIR Biohackathon 2022 will explore integration of RO-Crate and the Data Stewardship Wizard (Pergl et al. 2019) with Galaxy, which can automate FDO creation that also follows data management plans. A tailored RO-Crate profile has been defined to represent Electronic Lab Notebooks (ELN) protocols bundled together with metadata and related datasets. Schröder et al. (2022) uses RO-Crates to encode provenance information at different levels, including researchers, manufacturers, biological and chemical resources, activities, measurements, and resulting research data. The use of RO-Crates makes it easier to programmatically question-answer information related to the protocols, for instance activities, resources and equipment used to create data. Another example is WorkflowHub (Goble et al. 2021) which defines the Workflow RO-Crate profile (Bacall et al. 2022), imposing additional constraints such as the presence of a main workflow and a license. It also specifies which entity types and properties must be used to provide such information, implicitly defining a set of operations (e.g., get the main workflow and its language) that are valid on all complying crates. The workflow system Galaxy (The Galaxy Community 2022) retrieves such Workflow Crates using GA4GH TRS API. The workflow profile has been further extended (with OOP-like inheritance) in Workflow Testing RO-Crate, adding formal workflow testing components: this adds operations such as getting remote test instances and test definitions, used by the LifeMonitor service to keep track of the health status of multiple published workflows. While RO-Crates use Web technologies, they are also self-contained, moving data along with their metadata. This is a powerful construct for interoperability across FAIR repositories, but this raises some challenges with regards to mutability and persistence of crates. To illustrate how such challenges can be handled, we detail how the WorkflowHub repository follows several FDO principles: Workflow entries must be frozen for editing and have complete kernel metadata (title, authors, license, description) [FDOF4] before they can be assigned a persistent identifier, e.g. https://doi.org/10.48546/workflowhub.workflow.255.1 [FDOF1] Computational workflows can be composed of multiple files used as a whole, e.g. CWL files in a GitHub repository. These are snapshotted as a single RO-Crate ZIP, indicating the main workflow. [FDOF11] PID resolution can content-negotiate to Datacite’s PID metadata [FDOF2] or use FAIR Signposting to find an RO-Crate containing the workflow [FDOF3] and richer JSON-LD metadata resources [FDOF5,FDOF8], see Fig. 1 Metadata uses schema.org [FDOF7] following the community-developed Bioschemas ComputationalWorkflow profile [FDOF10]. Workflows are discovered using the GA4GH TRS API [FDOF5,FDOF6,FDOF11] and created/modified using CRUD operations [FDOF6] The RO-Crate profile, effectively the FDO Type [FDOF7], is declared as https://w3id.org/workflowhub/workflow-ro-crate/1.0; the workflow language (e.g. https://w3id.org/workflowhub/workflow-ro-crate#galaxy) is defined in metadata of the main workflow. Workflow entries must be frozen for editing and have complete kernel metadata (title, authors, license, description) [FDOF4] before they can be assigned a persistent identifier, e.g. https://doi.org/10.48546/workflowhub.workflow.255.1 [FDOF1] Computational workflows can be composed of multiple files used as a whole, e.g. CWL files in a GitHub repository. These are snapshotted as a single RO-Crate ZIP, indicating the main workflow. [FDOF11] PID resolution can content-negotiate to Datacite’s PID metadata [FDOF2] or use FAIR Signposting to find an RO-Crate containing the workflow [FDOF3] and richer JSON-LD metadata resources [FDOF5,FDOF8], see Fig. 1 Metadata uses schema.org [FDOF7] following the community-developed Bioschemas ComputationalWorkflow profile [FDOF10]. Workflows are discovered using the GA4GH TRS API [FDOF5,FDOF6,FDOF11] and created/modified using CRUD operations [FDOF6] The RO-Crate profile, effectively the FDO Type [FDOF7], is declared as https://w3id.org/workflowhub/workflow-ro-crate/1.0; the workflow language (e.g. https://w3id.org/workflowhub/workflow-ro-crate#galaxy) is defined in metadata of the main workflow. Further work on RO-Crate profiles include to formalise links to the API operations and repositories [FDOF5,FDOF7], to include PIDs of profiles and types in the FAIR Signposting, and HTTP navigation to individual resources within the RO-Crate. RO-Crate has shown a broad adoption by communities across many scientific disciplines, providing a lightweight, and therefore easy to adopt, approach to generating FAIR Digital Objects. It is rapidly becoming an integral part of the interoperability fabric between the different components as demonstrated here for WorkflowHub, contributing to building the European Open Science Cloud.
Background The FAIR principles (Wilkinson et al. 2016) are fundamental for data discovery, sharing, consumption and reuse; however their broad interpretation and many ways to implement can lead to inconsistencies and incompatibility (Jacobsen et al. 2020). The European Open Science Cloud (EOSC) has been instrumental in maturing and encouraging FAIR practices across a wide range of research areas. Linked Data in the form of RDF (Resource Description Framework) is the common way to implement machine-readability in FAIR, however the principles do not prescribe RDF or any particular technology (Mons et al. 2017). FAIR Digital Object FAIR Digital Object (FDO) (Schultes and Wittenburg 2019) has been proposed to improve researcher’s access to digital objects through formalising their metadata, types, identifiers and exposing their computational operations, making them actionable FAIR objects rather than passive data sources. FDO is a set of principles (Bonino et al. 2019), implementable in multiple ways. Current realisations mostly use Digital Object Interface Protocol (DOIPv2) (DONA Foundation 2018), with the main implementation CORDRA. We can consider DOIPv2 as a simplified combination of object-oriented (CORBA, SOAP) and document-based (HTTP, FTP) approaches. More recently, the FDO Forum has prepared detailed recommendations, currently open for comments, including a DOIP endorsement and updated FDO requirements. These point out Linked Data as another possible technology stack, which is the focus of this work. Linked Data Linked Data standards (LD), based on the Web architecture, are commonplace in sciences like bioinformatics, chemistry and medical informatics – in particular to publish Open Data as machine-readable resources. LD has become ubiquitous on the general Web, the schema.org vocabulary is used by over 10 million sites for indexing by search engines – 43% of all websites use JSON-LD. Although LD practices align to FAIR (Hasnain and Rebholz-Schuhmann 2018), they do not fully encompass active aspects of FDOs. The HTTP protocol is used heavily for applications (e.g. mobile apps and cloud services), with REST APIs of customised JSON structures. Approaches that merge the LD and REST worlds include Linked Data Platform (LDP), Hydra and Web Payments. Meeting FDO principles using Linked Data standards Considering the potential of FDOs when combined with the mature technology stack of LD, here we briefly discuss how FDO principles in Bonino et al. (2019) can be achieved using existing standards. The general principles (G1–G9) apply well: Open standards with HTTP being stable for 30 years, JSON-LD is widely used, FAIR practitioners mainly use RDF, and a clear abstraction between the RDF model with stable bindings available in multiple serialisations. However, when considering the specific principles (FDOF1–FDOF12) we find that additional constraints and best practices need to be established – arbitrary LD resources cannot be assumed to follow FDO principles. This is equivalent to how existing use of DOIP is not FDO-compliant without additional constraints. Namely, persistent identifiers (PIDs) (McMurry et al. 2017) (FDOF1) are common in LD world (e.g. using http://purl.org/ or https://w3id.org/), however they don’t always have a declared type (FDOF2), or the PID may not even appear in the metadata. URL-based PIDs are resolvable (FDOF3), typically over HTTP using redirections and content-negotiation. One great advantage of RDF is that all attributes are defined semantic artefacts with PIDs (FDOF4), and attributes can be reused across vocabularies. While CRUD operations (FDOF6) are supported by native HTTP operations (GET/PUT/POST/DELETE) as in LDP , there is little consistency on how to define operation interfaces in LD (FDOF5). Existing REST approaches like OpenAPI and URI templates are mature and good candidates, and should be related to defined types to support machine-actionable composition (FDOF7). HTTP error code 410 Gone is used in tombstone pages for removed resources (FDOF12), although more frequent is 404 Not Found. Metadata is resolved to HTTP documents with their own URIs, but these frequently don’t have their own PID (FDOF8). RDF-Star and nanopublications (Kuhn et al. 2021) give ways to identify and trace provenance of individual assertions. Different metadata levels (FDOF9) are frequently developed for LD vocabularies across different communities (FDOF10), such as FHIR for health data, Bioschemas for bioinformatics and >1000 more specific bioontologies. Increased declaration and navigation of profiles is therefore essential for machine-actionability and consistent consumption across FAIR endpoints. Several standards exist for rich collections (FDOF11), e.g. OAI-ORE, DCAT, RO-Crate, LDP. These are used and extended heterogeneously across the Web, but consistent machine-actionable FDOs will need specific choices of core standards and vocabularies. Another challenge is when multiple PIDs refer to “almost the same” concept in different collections – significant effort have created manual and automated semantic mappings (Baker et al. 2013, de Mello et al. 2022). Currently the FDO Forum has suggested the use of LDP as a possible alternative for implementing FAIR Digital Objects (Bonino da Silva Santos 2021), which proposes a novel approach of content-negotiation with custom media types. Discussion The Linked Data stack provides a set of specifications, tools and guidelines in order to help the FDO principles become a reality. This mature approach can accelerate uptake of FDO by scholars and existing research infrastructures such as the European Open Science Cloud (EOSC). However, the amount of standards and existing metadata vocabularies poses a potential threat for adoption and interoperability. Yet, the challenges for agreeing on usage profiles apply equally to DOIP as LD approaches. We have worked with different scientific communities to define RO-Crate (Soiland-Reyes et al. 2022), a lightweight method to package research outputs along with their metadata. While RO-Crate’s use of schema.org shows just one possible metadata model, it's powerful enough to be able to express FDOs, and familiar to web developers. We have also used FAIR Signposting (Van de Sompel et al. 2022) with HTTP Link: headers as a way to support navigation to the individual core properties of an FDO (PID, type, metadata, licence, bytestream) that does not require heuristics of content-negotiation and is agnostic to particular metadata vocabularies and serialisations. We believe that by adopting Linked Data principles, we can accelerate FDO today – and even start building practical ways to assist scientists in efficiently answering topical questions based on knowledge graphs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.