any diseases have been linked to SVs, most often defined as genomic changes at least 50 bp in size, but SVs are challenging to detect accurately. Conditions linked to SVs include autism 1 , schizophrenia, cardiovascular disease 2 , Huntington's disease and several other disorders 3. Far fewer SVs exist in germline genomes relative to small variants, but SVs affect more base pairs, and each SV might be more likely to affect phenotype 4-6. Although next-generation sequencing technologies can detect many SVs, each technology and analysis method has different strengths and weaknesses. To enable the community to
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment-and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls ≥50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping. GIAB is working towards a new version of the benchmark set that will use new technologies and methods such as PacBio Circular Consensus Sequencing and ultralong Oxford Nanopore sequencing to expand to more challenging genome regions and include more challenging SVs such as inversions. We are also developing a robust integration process to make calls on GRCh37 and GRCh38 for all seven GIAB samples.
A detailed understanding of gut microbial ecology is essential to engineer effective microbial therapeutics and to model microbial community assembly and succession in health and disease. However, establishing generalizable insights into the functional determinants of microbial fitness in the human gut has been a formidable challenge. Here we employ fecal microbiota transplantation (FMT) as an in natura experimental model to identify determinants of microbial colonization and resilience. Our long-term sampling strategy and high-resolution multi-omics analyses of FMT donors and recipients reveal adaptive ecological processes as the main driver of microbial colonization outcomes after FMT. We also show that high-fitness donor microbial populations are significantly enriched in metabolic pathways that are responsible for the biosynthesis of nucleotides, essential amino acids, and micronutrients, independent of taxonomy. To determine whether such metabolic competence can explain the microbial ecology of human disease states, we analyzed genomes reconstructed from healthy humans and humans with inflammatory bowel disease (IBD). Our data reveal that such traits are also significantly enriched in microbial genomes recovered from IBD patients, linking presence of superior metabolic competence in bacteria to their expansion in IBD. Overall, these findings suggest that the transfer of gut microbes from a healthy donor to a disrupted recipient environment initiates an environmental filter that selects for populations that can self-sustain. Such ecological processes that select for self-sustenance under stress offer a model to explain why common yet typically rare members of healthy gut environments can become dominant in inflammatory conditions without any need for them to be causally associated with, or contribute to, such disease states.
Structural variations (SV) are broadly defined as genomic alterations that affect >50bp of DNA, which are shown to have significant effect on evolution and disease. The advent of high throughput sequencing (HTS) technologies and the ability to perform whole genome sequencing (WGS), makes it feasible to study these variants in depth. However, discovery of all forms of SV using WGS has proven to be challenging as the short reads produced by the predominant HTS platforms (<200bp for current technologies) and the fact that most genomes include large amounts of repeats make it very difficult to unambiguously map and accurately characterize such variants. Furthermore, existing tools for SV discovery are primarily developed for only a few of the SV types, which may have conflicting sequence signatures (i.e. read pairs, read depth, split reads) with other, untargeted SV classes. Here we are introduce a new framework, Tardis, which combines multiple read signatures into a single package to characterize most SV types simultaneously, while preventing such conflicts. Tardis also has a modular structure that makes it easy to extend for the discovery of additional forms of SV.
Motivation Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants. Results We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions). Availability and implementation TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/. Supplementary information Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.