Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated $ \le $2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.]
The ability to rapidly generate and share molecular, visual, and acoustic data, and to compare them with existing information, and thereby to detect and name biological entities is fundamentally changing our understanding of evolutionary relationships among organisms and is also impacting taxonomy. Harnessing taxonomic data for rapid, automated species identification by machine learning tools or DNA metabarcoding techniques has great potential but will require their review, accessible storage, comprehensive comparison, and integration with prior knowledge and information. Currently, data production, management, and sharing in taxonomic studies are not keeping pace with these needs. Indeed, a survey of recent taxonomic publications provides evidence that few species descriptions in zoology and botany incorporate DNA sequence data. The use of modern highthroughput (-omics) data is so far the exception in alpha-taxonomy, although they are easily stored in GenBank and similar databases. By contrast, for the more routinely used image data, the problem is that they are rarely made available in openly accessible repositories. Improved sharing and re-using of both types of data requires institutions that maintain long-term data storage and capacity with workable, user-friendly but highly automated pipelines. Top priority should be given to standardization and pipeline development for the easy submission and storage of machine-readable data (e.g., images, audio files, videos, tables of measurements). The taxonomic community in Germany and the German Federation for Biological Data are researching options for a higher level of automation, improved linking among data submission and storage platforms, and for making existing taxonomic information more readily accessible.
The Paroedura bastardi clade, a subgroup of the Madagascan gecko genus Paroedura, currently comprises four nominal species: P. bastardi, supposedly widely distributed in southern and western Madagascar, P. ibityensis, a montane endemic, and P. tanjaka and P. neglecta, both restricted to the central west region of the island. Previous work has shown that Paroedura bastardi is a species complex with several strongly divergent mitochondrial lineages. Based on one mitochondrial and two nuclear markers, plus detailed morphological data, we undertake an integrative revision of this species complex. Using a representative sampling for seven nuclear and five mitochondrial genes we furthermore propose a phylogenetic hypothesis of relationships among the species in this clade. Our analyses reveal at least three distinct and independent evolutionary lineages currently referred to P. bastardi. Conclusive evidence for the species status of these lineages comes from multiple cases of syntopic occurrence without genetic admixture or morphological intermediates, suggesting reproductive isolation. We discuss the relevance of this line of evidence and the conditions under which concordant differentiation in unlinked loci under sympatry provides a powerful approach to species delimitation, and taxonomically implement our findings by (1) designating a lectotype for Paroedura bastardi, now restricted to the extreme South-East of Madagascar, (2) resurrecting of the binomen Paroedura guibeae Dixon & Kroll, 1974, which is applied to the species predominantly distributed in the South-West, and (3) describing a third species, Paroedura rennerae sp. nov., which has the northernmost distribution within the species complex.
The Paroedura bastardi clade, a subgroup of the Madagascan gecko genus Paroedura, currently comprises four nominal species: P. bastardi, supposedly widely distributed in southern and western Madagascar, P. ibityensis, a montane endemic, and P. tanjaka and P. neglecta, both restricted to the central west region of the island. Previous work has shown that Paroedura bastardi is a species complex with several strongly divergent mitochondrial lineages. Based on one mitochondrial and two nuclear markers, plus detailed morphological data, we undertake an integrative revision of this species complex. Using a representative sampling for seven nuclear and five mitochondrial genes we furthermore propose a phylogenetic hypothesis of relationships among the species in this clade. Our analyses reveal at least three distinct and independent evolutionary lineages currently referred to P. bastardi. Conclusive evidence for the species status of these lineages comes from multiple cases of syntopic occurrence without genetic admixture or morphological intermediates, suggesting reproductive isolation. We discuss the relevance of this line of evidence and the conditions under which concordant differentiation in unlinked loci under sympatry provides a powerful approach to species delimitation, and taxonomically implement our findings by (1) designating a lectotype for Paroedura bastardi, now restricted to the extreme South-East of Madagascar, (2) resurrecting of the binomen Paroedura guibeae Dixon & Kroll, 1974, which is applied to the species predominantly distributed in the South-West, and (3) describing a third species, Paroedura rennerae sp. nov., which has the northernmost distribution within the species complex.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.