Computational graph pangenomics: a tutorial on data structures and their applications

Baaijens, Jasmijn A.; Bonizzoni, Paola; Boucher, Christina; Vedova, Gianluca Della; Pirola, Yuri; Rizzi, Raffaella; Sirén, Jouni

doi:10.1007/s11047-022-09882-6

Cited by 28 publications

(30 citation statements)

References 97 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We denote a matrix of h rows and w columns as M[1.. h ][1.. w ]. We let col(M) j denote the j -th column of M, i.e., the string drawn as col(M) j = M[1.. h ][ j ] = M[1][ j ]M[2][ j ] … M[ h ][ j ].…”

Section: Definitionsmentioning

confidence: 99%

“…In the framework of computational pangenomics, the positional BWT (PBWT), which is a method of permuting the elements of each column of a h × w binary matrix M[1.. h ][1.. w ], is a key instrument in the compact representation of large haplotypes data sets [9]. Indeed, due to the intrinsic capability of the PBWT of saving space in memorizing haplotype data and even in analyzing large haplotypes panels, it is becoming a relevant data structure for pangenomics (see the recent tutorial on data structures [2]). It is used in relevant computational steps related to haplotype phasing and analysis, such as the matching procedure in reference panels of haplotypes.…”

Section: Introductionmentioning

confidence: 99%

“…It is used in relevant computational steps related to haplotype phasing and analysis, such as the matching procedure in reference panels of haplotypes. Moreover, the notion of PBWT has been extended to graph pangenomes representations of haplotypes with the name of Graph positional BWT or GBWT, and it is currently the building block of sequence to graph aligners in the VG toolkit [2].…”

Section: Introductionmentioning

confidence: 99%

“…It has been used for genotype imputation [26], and to create a genotype database search method that is privacy-preserving (PBWT-sec) [28]. Novak et al [22] and Sirén et al [29] used the PBWT to encode a graph for haplotype matching (g-PBWT) and graph pangenome indexing [2]. Sanaullah et al [27] replaced all arrays with linked lists to define a dynamic version of the PBWT (d-PBWT).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms

Bonizzoni

Boucher

Cozzi

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The positional Burrows–Wheeler Transform (PBWT) was presented in 2014 by Durbin as a means to find all maximal haplotype matches in h sequences containing w variation sites in O(hw)-time. This time complexity of finding maximal haplotype matches using the PBWT is a significant improvement over the naïve pattern-matching algorithm that requires O(h2w)-time. Compared to the more famous Burrows-Wheeler Transform (BWT), however, a relatively little amount of attention has been paid to the PBWT. This has resulted in less space-efficient data structures for building and storing the PBWT. Given the increasing size of available haplotype datasets, and the applicability of the PBWT to pangenomics, the time is ripe for identifying efficient data structures that can be constructed for large datasets. Here, we present a comprehensive study of the memory footprint of data structures supporting maximal haplotype matching in conjunction with the PBWT. In particular, we present several data structure components that act as building blocks for constructing six different data structures that store the PBWT in a manner that supports efficiently finding the maximal haplotype matches. We estimate the memory usage of the data structures by bounding the space usage with respect to the input size. In light of this experimental analysis, we implement the solutions that are deemed to be superior with respect to the memory usage and show the performance on haplotype datasets taken from the 1000 Genomes Project data.

show abstract

Section: Definitionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms

Bonizzoni

Boucher

Cozzi

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The latest version of industry-standard DRAGEN software by Illumina now uses a pangenome graph for mapping reads in highly polymorphic regions of a human genome [13]. For surveys of the recent algorithmic developments in this area, see [2,6,10,35]. Among the many computational tasks associated with pangenome graphs, sequence-to-graph alignment remains a core computational problem.…”

Section: Introductionmentioning

confidence: 99%

Sequence to graph alignment using gap-sensitive co-linear chaining

Chandra

Jain

2022

Preprint

View full text Add to dashboard Cite

Co-linear chaining is a widely used technique in sequence alignment tools that follow seed-filter-extend methodology. It is a mathematically rigorous approach to combine small exact matches. For co-linear chaining between two sequences, efficient subquadratic-time chaining algorithms are well-known for linear, concave and convex gap cost functions [Eppstein et al. JACM'92]. However, developing extensions of chaining algorithms for DAGs (directed acyclic graphs) has been challenging. Recently, a new sparse dynamic programming framework was introduced that exploits small path cover of pangenome reference DAGs, and enables efficient chaining [Makinen et al. TALG'19, RECOMB'18]. However, the underlying problem formulation did not consider gap cost which makes chaining less effective in practice. To address this, we develop novel problem formulations and optimal chaining algorithms that support a variety of gap cost functions. We demonstrate empirically the ability of our provably-good chaining implementation to align long reads more precisely in comparison to existing aligners. For mapping simulated long reads from human genome to a pangenome DAG of 95 human haplotypes, we achieve 98.7% precision while leaving < 2% reads unmapped. Implementation: https://github.com/at-cg/minichain

show abstract

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni

Felice

Pirola

et al. 2022

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.

show abstract

Computational graph pangenomics: a tutorial on data structures and their applications

Cited by 28 publications

References 97 publications

Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms

Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms

Sequence to graph alignment using gap-sensitive co-linear chaining

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Contact Info

Product

Resources

About