To improve the suitability of the Darwin Core standard for the research and management of alien species, the standard needs to express the native status of organisms, how well established they are and how they came to occupy a location. To facilitate this, we propose: 1. To adopt a controlled vocabulary for the existing Darwin Core term dwc:establishmentMeans 2. To elevate the pathway term from the Invasive Species Pathways extension to become a new Darwin Core term dwc:pathway maintained as part of the Darwin Core standard 3. To adopt a new Darwin Core term dwc:degreeOfEstablishment with an associated controlled vocabulary These changes to the standard will allow users to clearly state whether an occurrence of a species is native to a location or not, how it got there (pathway), and to what extent the species has become a permanent feature of the location. By improving Darwin Core for capturing and sharing these data, we aim to improve the quality of occurrence and checklist data in general and to increase the number of potential uses of these data.
For vast areas of the globe and large parts of the tree of life, data needed to inform trait diversity is incomplete. Such trait data, when fully assembled, however, form the link between the evolutionary history of organisms, their assembly into communities, and the nature and functioning of ecosystems. Recent efforts to close data gaps have focused on collating trait-by-species databases, which only provide species-level, aggregated value ranges for traits of interest and often lack the direct observations on which those ranges are based. Perhaps under-appreciated is that digitized biocollection records collectively contain a vast trove of trait data measured directly from individuals, but this content remains hidden and highly heterogeneous, impeding discoverability and use. We developed and deployed a suite of openly accessible software tools in order to collate a full set of trait descriptions and extract two key traits, body length and mass, from >18 million specimen records in VertNet, a global biodiversity data publisher and aggregator. We tested success rate of these tools against hand-checked validation data sets and characterized quality and quantity. A post-processing toolkit was developed to standardize and harmonize data sets, and to integrate this improved content into VertNet for broadest reuse. The result of this work was to add more than 1.5 million harmonized measurements on vertebrate body mass and length directly to specimen records. Rates of false positives and negatives for extracted data were extremely low. We also created new tools for filtering, querying, and assembling this research-ready vertebrate trait content for view and download. Our work has yielded a novel database and platform for harmonized trait content that will grow as tools introduced here become part of publication workflows. We close by noting how this effort extends to new communities already developing similar digitized content.Database URL: http://portal.vertnet.org/search?advanced=1
Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
To understand biological and geological events and the history of collected samples, it is essential to determine and communicate location information accurately. The accuracy of a georeference depends upon the circumstances of the event. Historical collections depend on having clear verbatim locality descriptions, the correct interpretation of data written on labels, and on the availability of gazetteers and maps of suitable scale and time. Observation and tracking data localities depend on GPS (Global Positiioning System) accuracy, and on presence or absence of nearby obstructions such as buildings, forest cover, cliffs, etc. Marine data depend on the accurate determination of the surface location and the techniques to position a dive event from that location and to determine its depth and extent. Many people are using smartphones or maps such as Google Earth and Google Maps to determine their georeferences – but are they suitable and accurate enough to determine locations and elevations? New editions of the Georeferencing Best Practices (Chapman and Wieczorek 2020), the Georeferencing Quick Reference Guide (Zermoglio et al. 2020), and the Georeferencing Calculator Manual (Bloom et al. 2020), were published earlier this year and address all the issues listed above and many more. These documents were based on earlier versions but have been updated and improved considerably – adding information for marine biomes, caves, lithographic stratifications, transects, and the use of elevation, as well as including many more illustrations and examples. The expansion of an extensive georeferencing glossary adds to consistency in the use of terms The trio of documents now provides consistent guidance about how to georeference diverse locality types and detailed instructions on how to calculate uncertainty using many different coordinate reference systems and datums (horizontal and vertical) along with the importance of recording this information. Finally, they provide guidance on how to set up a georeferencing project and how to relate the results to the Darwin Core Standard (Darwin Core Task Group 2009). For the last decade, Darwin Core (Wieczorek et al. 2012) has been one of the preferred standards for sharing biodiversity data, including associated location information. Darwin Core currently has 44 terms in its Location class, allowing sharing from administrative divisions, to elevations and depths, coordinates in different formats, and georeference metadata, among others. Although Darwin Core provides definitions for each of its terms, their correct use is sometimes poorly understood, resulting in information being captured incorrectly, or not captured, documented or shared at all. We will re-introduce these documents, discuss their content, importance, and differences from previously published versions. The newly revised documents provide guidance on capturing and documenting georeferences, clarifying the georeferencing process and showing how to capture information using Darwin Core appropriately. They will improve the location data associated with biological events and our understanding of these events.
The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community. The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values. Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.