2018
DOI: 10.1186/s12864-018-4465-8
|View full text |Cite
|
Sign up to set email alerts
|

Towards pan-genome read alignment to improve variation calling

Abstract: BackgroundTypical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
37
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 33 publications
(38 citation statements)
references
References 29 publications
0
37
0
Order By: Relevance
“…We considered to include running BWA-MEM on an index created by CHOP [9], which is a tool for indexing paths through known haplotypes in the graph, but we were unable to build the CHOP index for a whole human genome human graph in reasonable run-time. We also considered to include PanVC [11] in the comparison, but also this method does not seem to scale to a whole human genome, for which it takes weeks to run on [11]. When using vg to simulate reads, we used base pair substitution rate 0.01 and indel rate 0.02.…”
Section: Assessment Of Mapping Methodsmentioning
confidence: 99%
“…We considered to include running BWA-MEM on an index created by CHOP [9], which is a tool for indexing paths through known haplotypes in the graph, but we were unable to build the CHOP index for a whole human genome human graph in reasonable run-time. We also considered to include PanVC [11] in the comparison, but also this method does not seem to scale to a whole human genome, for which it takes weeks to run on [11]. When using vg to simulate reads, we used base pair substitution rate 0.01 and indel rate 0.02.…”
Section: Assessment Of Mapping Methodsmentioning
confidence: 99%
“…In contrast, the r-index, of which we provide an implementation in this work, has no such length limitation. The most recent implementation of the hybrid index is CHIC (Valenzuela et al, 2018; based on CHICO; Valenzuela, 2016). Although CHIC has support for counting multiple occurrences of a pattern within a genomic database, it is an expensive operation, namely O(' log log n), where ' is the number of occurrences in the databases and n is the length of the database.…”
Section: Related Workmentioning
confidence: 99%
“…This implies that the FM-index will become much slower and/or much larger as the number of genomes in the databases grows significantly. This bottleneck has forced researchers to consider variations of FM-indexes adapted for massive genomic data sets, such as the Valenzuela et al (2018) pan-genomic index or the Garrison et al (2018) variation graph. Some of these proposals use elements of the FM-index, but all deviate in substantial ways from the description above.…”
Section: Introductionmentioning
confidence: 99%
“…Aligning reads against whole genomic databases is called pan-genomic alignment [8] and should help genomic processing and storage catch up with sequencing. Unfortunately, although several authors have proposed other kinds of indexes (see, e.g., [9,10] and references therein), they lack the complete functionality of the FM-index and have not achieved the same popularity. In particular, they often limit the maximum length of a pattern, which will become problematic The read ATAC does not match exactly against the reference GATTACAT but does against the second genome GATACAT we assembled, so if we add that genome to the index then we can avoid using approximate pattern matching to align that read.…”
Section: Introductionmentioning
confidence: 99%