BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

Guidi, Giulia; Ellis, Marquita; Rokhsar, Daniel S.; Yelick, Katherine; Buluç, Aydın

doi:10.1101/464420

Cited by 12 publications

(35 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For these reasons, we chose BELLA as the basis for our distributed memory algorithm. e quality produced by diBELLA is at least that of BELLA (see [13] for quality comparisons over data sets also used in this study), and higher when using less restricted sets of seeds than [13].…”

Section: Related Workmentioning

confidence: 89%

“…ese include the minimum distance between seeds, and the maximum number of seeds to explore per overlap. A discussion of these se ings in relation to alignment accuracy versus computational cost is presented in the BELLA analysis [13]. In general, increasing the number of seeds to explore per overlap increases computational cost of the alignment stage (not necessarily linearly), depending on the pairwise alignment kernel employed.…”

Section: Overlapmentioning

confidence: 99%

“…We introduce diBELLA, the rst long-read parallel distributedmemory overlapper and aligner. diBELLA uses the methods in BELLA [13], an accurate and e cient single node overlapper and aligner that takes advantage of the statistical properties of the underlying data, including error rate and read length to e ciently and accurately compute overlaps. BELLA is based on a seed-and-extend approach, common to other aligners [2], which nds read pairs that are likely to overlap using a near-linear time algorithm and then performing alignments on those pairs.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

diBELLA

Ellis

Guidi

Buluç

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

We present a parallel algorithm and scalable implementation for genome analysis, speci cally the problem of nding overlaps and alignments for data from "third generation" long read sequencers [27]. While long sequences of DNA o er enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require di erent approaches to analysis than their short read counterparts. Our work focuses on an e cient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. e resulting application, diBELLA, is the rst distributed memory overlapper and aligner speci cally designed for long reads and parallel scalability. We describe and present analyses for high level design trade-o s and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.

show abstract

Section: Related Workmentioning

confidence: 89%

Section: Overlapmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

diBELLA

Ellis

Guidi

Buluç

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…SpGEMM is a relatively unknown primitive in genomics. Most notably, Besta et al [3] used SpGEMM to compute similarity between genomes in distributed memory, after the appearance of our preprint [15].…”

Section: Related Workmentioning

confidence: 99%

“…The threshold t = 2 kb is derived from the procedure proposed by Heng Li [19] and the ground truth is generated using Minimap2. A description of our evaluation procedure and ground truth generation can be found in the supplementary material of our preprint [15].…”

Section: Experimental Settingmentioning

confidence: 99%

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

Guidi¹,

Ellis²,

Rokhsar³

et al. 2021

SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)

Self Cite

View full text Add to dashboard Cite

Recent advances in long-read sequencing allow characterization of genome structure and its variation within and between species at a resolution not previously possible. Detection of overlap between reads is an essential component of many long read genome pipelines, such as de novo genome assembly. Longer reads simplify genome assembly and improve reconstruction contiguity, but current long read technologies are associated with moderate to high error rates.In this work, we present Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel overlap detection and alignment algorithm using sparse matrixmatrix multiplication. In addition, we present a probabilistic model that demonstrates the feasibility of using k-mers for overlap candidate detection and shows its flexibility when applied to different k-mer selection strategies. Based on such a model, we introduce a notion of reliable k-mers. Combining reliable k-mers with our binning mechanism increases the computational efficiency and accuracy of our algorithm. Finally, we present a new method based on Chernoff bounds to separate true overlaps from false positives by combining alignment techniques and probabilistic modeling. Our goal is to maximize the balance of precision and recall.For both real and synthetic data, BELLA is among the best F1 scores, showing a stability of performance that is often lacking in competing software. BELLA's F1 score is consistently within 1.7% of the top performer. In particular, we show improved de novo assembly quality on synthetic data when BELLA is coupled with the miniasm assembler.

show abstract

The parallelism motifs of genomic data analysis

Yelick

Buluç

Awan

et al. 2020

Phil. Trans. R. Soc. A.

Self Cite

View full text Add to dashboard Cite

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

show abstract

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

Cited by 12 publications

References 35 publications

diBELLA

diBELLA

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

The parallelism motifs of genomic data analysis

Contact Info

Product

Resources

About