2021
DOI: 10.1101/2021.11.15.468652
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Extensive gene duplication in Arabidopsis revealed by pseudo-heterozygosity

Abstract: BackgroundIt is becoming apparent that genomes harbor massive amounts of structural variation, and that this variation has largely gone undetected for technical reasons. In addition to being inherently interesting, structural variation can cause artifacts when short-read sequencing data are mapped to a reference genome. In particular, spurious SNPs (that do not show Mendelian segregation) may result from mapping of reads to duplicated regions. Recalling SNP using the raw reads of the 1001 Arabidopsis Genomes P… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
10
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 10 publications
(12 citation statements)
references
References 71 publications
2
10
0
Order By: Relevance
“…An additional 3842 SVs from the mate-pair data set were initially included in the nanopore data set but were removed during filtering because all individuals from one of the two parental species were heterozygotes. Most of the SVs with this pattern of excessive heterozygosity were deletions or insertions, consistent with expectations for pseudo-heterozygosity caused by errors in mapping transposable elements to the reference genome (Jaegle et al 2021). The proportion of SVs exhibiting excessive heterozygosity from the mate-pair sequencing data set (~32%) was much higher than for the nanopore sequence data set (~12.7%).…”
Section: Prevalence Of Svs and Sequencing Technologiessupporting
confidence: 76%
See 1 more Smart Citation
“…An additional 3842 SVs from the mate-pair data set were initially included in the nanopore data set but were removed during filtering because all individuals from one of the two parental species were heterozygotes. Most of the SVs with this pattern of excessive heterozygosity were deletions or insertions, consistent with expectations for pseudo-heterozygosity caused by errors in mapping transposable elements to the reference genome (Jaegle et al 2021). The proportion of SVs exhibiting excessive heterozygosity from the mate-pair sequencing data set (~32%) was much higher than for the nanopore sequence data set (~12.7%).…”
Section: Prevalence Of Svs and Sequencing Technologiessupporting
confidence: 76%
“…Importantly, Sniffles filters false SV signals by considering both minimum read support as well as consistency of the breakpoint position and size. Because loci with excessive htererozygosity are likely caused by erroneous mapping of transposable elements to the reference genome (Jaegle et al 2021), we excluded SVs with excessive htererozygosity, defined for this purpose as all individuals in either species initially being called heterozygotes because they had reads supporting the reference and SV allele. Additionally, we excluded SVs with sequence data for less than 20% of the individuals, which left a total of 290,276 SVs for downstream analysis.…”
Section: Base Calling Structural Variant Calling and Variant Filteringmentioning
confidence: 99%
“…To assemble contigs with the CLR dataset, we used Canu with a maximum input coverage of 200x, only using subreads larger than 10 kb, and polished the resulting assembly with Arrow (34), also using 200x of the initial long-reads. The resulting contigs had an NG50 of 14.82 Mb, which is on a par with the best published Arabidopsis thaliana CLR contigs (1217).…”
Section: Resultsmentioning
confidence: 75%
“…To assemble contigs with the CLR dataset, we used Canu with a maximum input coverage of 200x, only using subreads larger than 10 kb, and polished the resulting assembly with Arrow (34), also using 200x of the initial long-reads. The resulting contigs had an NG50 of 14.82 Mb, which is on a par with the best published Arabidopsis thaliana CLR contigs (12)(13)(14)(15)(16)(17). With the HiFi dataset, we compared the performance of five different assemblers: FALCON (23), HiCanu (30), Hifiasm (31), Peregrine (32), and Pacbio's Improved Phased Assembler (IPA; (33)).…”
Section: Performance Of the Assembler Of Choicementioning
confidence: 99%
See 1 more Smart Citation