pIRS is written in C++ and Perl, and is freely available at ftp://ftp.genomics.org.cn/pub/pIRS/.
We present a new approach to indel calling that explicitly exploits that indel differences between a reference and a se-quenced sample make the mapping of reads less efficient. We assign all unmapped reads with a mapped partner to their expected genomic positions and then perform extensive de novo assembly on the regions with many unmapped reads to resolve homozygous, heterozygous, and complex indels by exhaustive traversal of the de Bruijn graph. The method is implemented in the software SOAPindel and provides a list of candidate indels with quality scores. We compare SOAPindel to Dindel, Pindel, and GATK on simulated data and find similar or better performance for short indels (<10 bp) and higher sensitivity and specificity for long indels. A validation experiment suggests that SOAPindel has a false-positive rate of ~10% for long indels (>5 bp), while still providing many more candidate indels than other approaches. [Supplemental material is available for this article.] Calling indels from the mapping of short paired-end sequences to a reference genome is much more challenging than SNP calling because the indel by itself interferes with accurate mapping and therefore indels up to a few base pairs in size are allowed in the most popular mapping approaches (Li et al. 2008; Li and Durbin 2009; Li et al. 2009). The most powerful indel calling approach would be to perform de novo assembly of each genome and identify indels by alignment of genomes. However, this is compu-tationally daunting and requires very high sequencing coverage. Therefore, local approaches offer more promise. Recent approaches exploit the paired-end information to perform local realignment of poorly mapped pairs, thus allowing for longer indels (Ye et al. 2009; Homer and Nelson 2010; McKenna et al. 2010; Albers et al. 2011). One such approach, Dindel, maps reads to a set of candidate haplotypes obtained from mapping or from external information. It uses a probabilistic framework that naturally integrates various sources of sequencing errors and was found to have high specificity for identification of indels of sizes up to half the read length (Albers et al. 2011). Deletions longer than that can be called using split read approaches such as implemented in Pindel (Ye et al. 2009). Long insertions remain problematic because short reads will not span them and a certain amount of de novo assembly is required. Our approach, implemented in SOAPindel, performs full local de novo assembly of regions where reads appear to map poorly as indicated by an excess of paired-end reads where only one of the mates maps. The idea is to collect all unmapped reads at their expected genomic positions, then perform a local assembly of the regions with a high density of such reads and finally align these assemblies to the reference. A related idea has recently been published by Carnevali et al. (2012), but their approach is designed for a different sequencing method, and software is not available for comparison. While conceptually simple, our approach is sensitive to v...
Summary1. Metabarcoding of mixed arthropod samples for biodiversity assessment has mostly been carried out on the 454 GS FLX sequencer (Roche, Branford, Connecticut, USA), due to its ability to produce long reads (≥400 bp) that are believed to allow higher taxonomic resolution. The Illumina sequencing platforms, with their much higher throughputs, could potentially reduce sequencing costs and improve sequence quality, but the associated shorter read length (typically <150 bp) has deterred their usage in next-generation-sequencing (NGS)-based analyses of eukaryotic biodiversity, which often utilize standard barcode markers (e.g. COI, rbcL, matK, ITS) that are hundreds of nucleotides long. 2. We present a new Illumina-based pipeline to recover full-length COI barcodes from mixed arthropod samples. Our new assembly program, SOAPBarcode, a variant of the genome assembly program SOAPdenovo, uses paired-end reads of the standard COI barcode region as anchors to extract the correct pathways (sequences) out of otherwise chaotic 'de Bruijn graphs', which are caused by the presence of large numbers of COI homologs of high sequence similarity. 3. Two bulk insect samples of known species composition have been analysed in a recently published 454 metabarcoding study (Yu et al. 2012) and are re-analysed by our analysis pipeline. Compared to the results of Roche 454 (c. 400-bp reads), our pipeline recovered full-length COI barcodes (658 bp) and 17-31% more species-level operational taxonomic units (OTUs) from bulk insect samples, with fewer untraceable (novel) OTUs. On the other hand, our PCR-based pipeline also revealed higher rates of contamination across samples, due to the Illumina's increased sequencing depth. On balance, the assembled full-length barcodes and increased OTU recovery rates resulted in more resolved taxonomic assignments and more accurate beta diversity estimation. 4. The HiSeq 2000 and the SOAPBarcode pipeline together can achieve more accurate biodiversity assessment at a much reduced sequencing cost in metabarcoding analyses. However, greater precaution is needed to prevent cross-sample contamination during field preparation and laboratory operation because of greater ability to detect non-target DNA amplicons present in low-copy numbers.
Background Very low-coverage (0.1 to 1×) whole genome sequencing (WGS) has become a promising and affordable approach to discover genomic variants of human populations for genome-wide association study (GWAS). To support genetic screening using preimplantation genetic testing (PGT) in a large population, the sequencing coverage goes below 0.1× to an ultra-low level. However, the feasibility and effectiveness of ultra-low-coverage WGS (ulcWGS) for GWAS remains undetermined. Methods We built a pipeline to carry out analysis of ulcWGS data for GWAS. To examine its effectiveness, we benchmarked the accuracy of genotype imputation at the combination of different coverages below 0.1× and sample sizes from 2000 to 16,000, using 17,844 embryo PGT samples with approximately 0.04× average coverage and the standard Chinese sample HG005 with known genotypes. We then applied the imputed genotypes of 1744 transferred embryos who have gestational ages and complete follow-up records to GWAS. Results The accuracy of genotype imputation under ultra-low coverage can be improved by increasing the sample size and applying a set of filters. From 1744 born embryos, we identified 11 genomic risk loci associated with gestational ages and 166 genes mapped to these loci according to positional, expression quantitative trait locus, and chromatin interaction strategies. Among these mapped genes, CRHBP, ICAM1, and OXTR were more frequently reported as preterm birth related. By joint analysis of gene expression data from previous studies, we constructed interrelationships of mainly CRHBP, ICAM1, PLAGL1, DNMT1, CNTLN, DKK1, and EGR2 with preterm birth, infant disease, and breast cancer. Conclusions This study not only demonstrates that ulcWGS could achieve relatively high accuracy of adequate genotype imputation and is capable of GWAS, but also provides insights into the associations between gestational age and genetic variations of the fetal embryos from Chinese population.
Background: Very low coverage (0.1 to 1x) whole genome sequencing (WGS) has become a promising and affordable approach to discover genomic variants of human populations for Genome-Wide Association Study (GWAS). To support genetic screening using Preimplantation Genetic Testing (PGT) in a large population, the sequencing coverage goes below 0.1x to an ultra-low level. However, its feasibility and effectiveness for GWAS remains undetermined. Methods: We devised a pipeline to process ultra-low coverage WGS data and benchmarked the accuracy of genotype imputation at the combination of different coverages below 0.1x and sample sizes from 2,000 to 16,000, using 17,844 embryo PGT with approximately 0.04x average coverage and the standard Chinese sample HG005 with known genotypes. We then applied the imputed genotypes of 1,744 transferred embryos who have gestational ages and complete follow-up records to GWAS. Results: The accuracy of genotype imputation under ultra-low coverage can be improved by increasing the sample size and applying a set of filters. From 1,744 born embryos, we identified 11 genomic risk loci associated with gestational ages and 166 genes mapped to these loci according to positional, expression quantitative trait locus and chromatin interaction strategies. Among these mapped genes, CRHBP, ICAM1 and OXTR were more frequently reported as preterm birth related. By joint analysis of gene expression data from previous studies, we constructed interrelationships of mainly CRHBP, ICAM1, PLAGL1, DNMT1, CNTLN, DKK1 and EGR2 with preterm birth, infant disease and breast cancer. Conclusions: This study not only demonstrates that ultra-low coverage WGS could achieve relatively high accuracy of adequate genotype imputation and is capable of GWAS, but also provides insights into uncovering genetic associations of gestational age trait existed in the fetal embryo samples from Chinese or Eastern Asian populations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.