1. Species occurrence records from online databases are an indispensable resource in ecological, biogeographical and palaeontological research. However, issues with data quality, especially incorrect geo-referencing or dating, can diminish their usefulness. Manual cleaning is time-consuming, error prone, difficult to reproduce and limited to known geographical areas and taxonomic groups, making it impractical for datasets with thousands or millions of records.2. Here, we present CoordinateCleaner, an r-package to scan datasets of species occurrence records for geo-referencing and dating imprecisions and data entry errors in a standardized and reproducible way. CoordinateCleaner is tailored to problems common in biological and palaeontological databases and can handle datasets with millions of records. The software includes (a) functions to flag potentially problematic coordinate records based on geographical gazetteers, (b) a global database of 9,691 geo-referenced biodiversity institutions to identify records that are likely from horticulture or captivity, (c) novel algorithms to identify datasets with rasterized data, conversion errors and strong decimal rounding and (d) spatio-temporal tests for fossils.3. We describe the individual functions available in CoordinateCleaner and demonstrate them on more than 90 million occurrences of flowering plants from the Global Biodiversity Information Facility (GBIF) and 19,000 fossil occurrences from the Palaeobiology Database (PBDB). We find that in GBIF more than 3.4 million records (3.7%) are potentially problematic and that 179 of the tested contributing This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
The unparalleled biodiversity found in the American tropics (the Neotropics) has attracted the attention of naturalists for centuries. Despite major advances in recent years in our understanding of the origin and diversification of many Neotropical taxa and biotic regions, many questions remain to be answered. Additional biological and geological data are still needed, as well as methodological advances that are capable of bridging these research fields. In this review, aimed primarily at advanced students and early-career scientists, we introduce the concept of “trans-disciplinary biogeography,” which refers to the integration of data from multiple areas of research in biology (e.g., community ecology, phylogeography, systematics, historical biogeography) and Earth and the physical sciences (e.g., geology, climatology, palaeontology), as a means to reconstruct the giant puzzle of Neotropical biodiversity and evolution in space and time. We caution against extrapolating results derived from the study of one or a few taxa to convey general scenarios of Neotropical evolution and landscape formation. We urge more coordination and integration of data and ideas among disciplines, transcending their traditional boundaries, as a basis for advancing tomorrow’s ground-breaking research. Our review highlights the great opportunities for studying the Neotropical biota to understand the evolution of life.
To understand the current biodiversity crisis, it is crucial to determine how humans have affected biodiversity in the past. However, the extent of human involvement in species extinctions from the Late Pleistocene onward remains contentious. Here, we apply Bayesian models to the fossil record to estimate how mammalian extinction rates have changed over the past 126,000 years, inferring specific times of rate increases. We specifically test the hypothesis of human-caused extinctions by using posterior predictive methods. We find that human population size is able to predict past extinctions with 96% accuracy. Predictors based on past climate, in contrast, perform no better than expected by chance, suggesting that climate had a negligible impact on global mammal extinctions. Based on current trends, we predict for the near future a rate escalation of unprecedented magnitude. Our results provide a comprehensive assessment of the human impact on past and predicted future extinctions of mammals.
The estimation of diversification rates is one of the most vividly debated topics in modern systematics, with considerable controversy surrounding the power of phylogenetic and fossil-based approaches in estimating extinction. Van Valen’s seminal work from 1973 proposed the “Law of constant extinction,” which states that the probability of extinction of taxa is not dependent on their age. This assumption of age-independent extinction has prevailed for decades with its assessment based on survivorship curves, which, however, do not directly account for the incompleteness of the fossil record, and have rarely been applied at the species level. Here, we present a Bayesian framework to estimate extinction rates from the fossil record accounting for age-dependent extinction (ADE). Our approach, unlike previous implementations, explicitly models unobserved species and accounts for the effects of fossil preservation on the observed longevity of sampled lineages. We assess the performance and robustness of our method through extensive simulations and apply it to a fossil data set of terrestrial Carnivora spanning the past 40 myr. We find strong evidence of ADE, as we detect the extinction rate to be highest in young species and declining with increasing species age. For comparison, we apply a recently developed analogous ADE model to a dated phylogeny of extant Carnivora. Although the phylogeny-based analysis also infers ADE, it indicates that the extinction rate, instead, increases with increasing taxon age. The estimated mean species longevity also differs substantially, with the fossil-based analyses estimating 2.0 myr, in contrast to 9.8 myr derived from the phylogeny-based inference. Scrutinizing these discrepancies, we find that both fossil and phylogeny-based ADE models are prone to high error rates when speciation and extinction rates increase or decrease through time. However, analyses of simulated and empirical data show that fossil-based inferences are more robust. This study shows that an accurate estimation of ADE from incomplete fossil data is possible when the effects of preservation are jointly modeled, thus allowing for a reassessment of Van Valen’s model as a general rule in macroevolution.
Abstract.-Advances in high-throughput sequencing techniques now allow relatively easy 20 and affordable sequencing of large portions of the genome, even for non-model organisms. 21Many phylogenetic studies prefer to reduce costs by focusing their sequencing efforts on a 22 selected set of targeted loci, commonly enriched using sequence capture. The advantage of 23 this approach is that it recovers a consistent set of loci, each with high sequencing depth, 24 which leads to more confidence in the assembly of target sequences. High sequencing depth 25 can also be used to identify phylogenetically informative allelic variation within sequenced 26 individuals, but allele sequences are infrequently assembled in phylogenetic studies. 27Instead, many scientists perform their phylogenetic analyses using contig sequences which 28 result from the de novo assembly of sequencing reads into contigs containing only canonical 29 nucleobases, and this may reduce both statistical power and phylogenetic accuracy. Here, 30we develop an easy-to-use pipeline to recover allele sequences from sequence capture data, 31and we use simulated and empirical data to demonstrate the utility of integrating these 32 allele sequences to analyses performed under the Multispecies Coalescent (MSC) model. 33Our empirical analyses of Ultraconserved Element (UCE) locus data collected from the 34 South American hummingbird genus Topaza demonstrate that phased allele sequences carry 35 sufficient phylogenetic information to infer the genetic structure, lineage divergence, and 36 biogeographic history of a genus that diversified during the last three million years, support 37 the recognition of two species, and suggest a high rate of gene flow across large distances of 38 rainforest habitats but rare admixture across the Amazon River. Our simulations show 39 that analyzing allele sequences leads to more accurate estimates of tree topology and 40 divergence times than the more common approach of using contig sequences. We conclude 41 that allele phasing may be the most appropriate processing scheme for phylogenetic 42 analyses of UCE data in particular, and sequence capture data, more generally. (Fig. 4). Hereafter, we use "contigs" and "contig 61 sequences" to refer to the sequences that are output by de novo assemblers. 62One alternative approach to generating contig sequences uses the depth of 29, 2018; estimation of gene trees, species trees, and divergence times (Garrick et al. 2010; Potts 72 et al. 2014; Lischer et al. 2014). The common practice of neglecting allelic information in 73 phylogenetic studies possibly results from historical inertia and a lack of computational 74 pipelines to prepare allele sequences for phylogenetic analysis using MPS data. CC-BY-ND4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/255752 doi: bioRxiv preprint first posted online Jan. 75In addition to the problems of determining allelic se...
Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing technologies such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.
Advances in high-throughput sequencing techniques now allow relatively easy and affordable sequencing of large portions of the genome, even for non-model organisms. Many phylogenetic studies reduce costs by focusing their sequencing efforts on a selected set of targeted loci, commonly enriched using sequence capture. The advantage of this approach is that it recovers a consistent set of loci, each with high sequencing depth, which leads to more confidence in the assembly of target sequences. High sequencing depth can also be used to identify phylogenetically informative allelic variation within sequenced individuals, but allele sequences are infrequently assembled in phylogenetic studies. Instead, many scientists perform their phylogenetic analyses using contig sequences which result from the de novo assembly of sequencing reads into contigs containing only canonical nucleobases, and this may reduce both statistical power and phylogenetic accuracy. Here, we develop an easy-to-use pipeline to recover allele sequences from sequence capture data, and we use simulated and empirical data to demonstrate the utility of integrating these allele sequences to analyses performed under the Multispecies Coalescent (MSC) model. Our empirical analyses of Ultraconserved Element (UCE) locus data collected from the South American hummingbird genus Topaza demonstrate that phased allele sequences carry sufficient phylogenetic information to infer the genetic structure, lineage divergence, and biogeographic history of a genus that diversified during the last three million years. The phylogenetic results support the recognition of two species, and suggest a high rate of gene flow across large distances of rainforest habitats but rare admixture across the Amazon River. Our simulations provide evidence that analyzing allele sequences leads to more accurate estimates of tree topology and divergence times than the more common approach of using contig sequences.
High-throughput DNA sequencing techniques enable time-and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing effort on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing coverage. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth. Moreover, it has proven to produce powerful, large multi-locus DNA sequence datasets suitable for phylogenetic analyses. However, target capture requires careful considerations, which may greatly affect the success of experiments. Here we provide a simple flowchart for designing phylogenomic target capture experiments. We discuss necessary decisions from the identification of target loci to the final bioinformatic processing of sequence data. We outline challenges and solutions related to the taxonomic scope, sample quality, and available genomic resources of target capture projects. We hope this review will serve as a useful roadmap for designing and carrying out successful phylogenetic target capture studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.