Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Kuhnle, Alan; Mun, Taher; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni

doi:10.1089/cmb.2019.0309

Cited by 41 publications

(20 citation statements)

References 35 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This might be accomplished using unsupervised, sequence-driven clustering methods [ 34 , 35 ], using the “founder sequence” framework [ 36 , 37 ], or using some form of submodular optimization [ 38 ]. A more radical idea is to simply index all available individuals, forgoing the need to choose representatives; this is becoming more practical with the advent of new approaches for haplotype-aware path indexing [ 31 ] and efficient indexing for repetitive texts [ 39 ].…”

Section: Discussionmentioning

confidence: 99%

Reference flow: reducing reference bias using multiple population genomes

et al. 2021

Self Cite

View full text Add to dashboard Cite

Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

show abstract

Section: Discussionmentioning

confidence: 99%

Reference flow: reducing reference bias using multiple population genomes

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…As previously mentioned, Gagie et al (2020) did not describe how to build the r-index -this was shown in a series of papers (Kuhnle et al 2020;Mun et al 2020;Boucher et al 2019). In particular, Boucher et al (2019) introduced Prefix Free Parsing (PFP), which takes as input a string S, window size w, and a prime p and produces a dictionary of substrings of S and a parse of S, that is a sequence of substrings in the alphabet (Kreft and Navarro 2013) -and showed how to build RLBWT from the dictionary and parse.…”

Section: How To Construct the R-indexmentioning

confidence: 99%

Computational graph pangenomics: a tutorial on data structures and their applications

et al. 2022

Self Cite

View full text Add to dashboard Cite

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations—thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

show abstract

“…Briefly explained, FM-index alignment tools are derived from the Burrows-Wheeler Transform [ 68 ]—a method to sufficiently compress large amount of data and finding approximate matches of sequences in the reference genome [ 69 ]. Hash table-based aligners uses the seed-and-extend method in combination with additional alignment algorithms [ 68 , 70 , 71 ].…”

Section: Precautions Of Data Output From Sequencingmentioning

confidence: 99%

Next Generation Sequencing Technology in the Clinic and Its Challenges

Vestergaard

Oliveira

Høgdall

et al. 2021

Cancers

View full text Add to dashboard Cite

Data analysis has become a crucial aspect in clinical oncology to interpret output from next-generation sequencing-based testing. NGS being able to resolve billions of sequencing reactions in a few days has consequently increased the demand for tools to handle and analyze such large data sets. Many tools have been developed since the advent of NGS, featuring their own peculiarities. Increased awareness when interpreting alterations in the genome is therefore of utmost importance, as the same data using different tools can provide diverse outcomes. Hence, it is crucial to evaluate and validate bioinformatic pipelines in clinical settings. Moreover, personalized medicine implies treatment targeting efficacy of biological drugs for specific genomic alterations. Here, we focused on different sequencing technologies, features underlying the genome complexity, and bioinformatic tools that can impact the final annotation. Additionally, we discuss the clinical demand and design for implementing NGS.

show abstract

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Cited by 41 publications

References 35 publications

Reference flow: reducing reference bias using multiple population genomes

Reference flow: reducing reference bias using multiple population genomes

Computational graph pangenomics: a tutorial on data structures and their applications

Next Generation Sequencing Technology in the Clinic and Its Challenges

Contact Info

Product

Resources

About