Jamshed Khan scite author profile

Patro

2021

Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. Availability and implementation Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. Supplementary information Supplementary data are available at Bioinformatics online.

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

et al. 2022

The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17–23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54–58 h, using considerably more memory.

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

Kokot

Deorowicz

et al. 2021

Preprint

The de Bruijn graph has become a key data structure in modern computational genomics, and of keen interest is its compacted variant. The compacted de Bruijn graph provides a lossless representation of the graph, and it is often considerably more efficient to store and process than its non-compacted counterpart. Construction of the compacted de Bruijn graph resides upstream of many genomic analyses. As the quantity of sequencing data and the number of reference genomes on which to perform these analyses grow rapidly, efficient construction of the compacted graph becomes a computational bottleneck for these tasks.We present Cuttlefish 2, significantly advancing the existing state-of-the-art methods for construction of this graph. On a typical shared-memory machine, it reduces the construction of the compacted de Bruijn graph for 661K bacterial genomes (2.58 Tbp of input reference genomes) from about 4.5 days to 17–23 hours. Similarly on sequencing data, it constructs the graph for a 1.52 Tbp white spruce read set in about 10 hours, while the closest competitor, which also uses considerably more memory, requires 54–58 hours.

Spectrum preserving tilings enable sparse and modular reference indexing

Fan

Pibiri

et al. 2022

Preprint

The reference indexing problem for k-mers is to pre-process a collection of reference genomic sequences ℛ so that the position of all occurrences of any queried k-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of ℛ that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in ℛ. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over a SPT decomposes the reference indexing problem for k-mers into: (1) a k-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k-mer-sets can be used to efficiently implement the k-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k-mers in ℛ. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool: pufferfish2. When indexing 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3G to 34.6G while incurring only a 3.6x slowdown when querying k-mers from a sequenced readset. Availability: pufferfish2 implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2 .

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Patro

2020

Preprint

Motivation: The construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow. Results: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory. Availability: Cuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish.