A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Singh, Gagandeep; Alser, Mohammed; Khodamoradi, Alireza; Denolf, Kristof; Fırtına, Can; Banu, Cavlak, Meryem; Corporaal, Henk; Mutlu, Onur

doi:10.1101/2022.11.20.517297

Cited by 5 publications

(19 citation statements)

References 116 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One of the immediate steps after generating raw nanopore signals is their translation to their corresponding DNA bases as sequences of characters with a computationallyintensive step, basecalling. Basecalling approaches are usually computationally costly and consume significant energy as they use complex deep learning models [26][27][28][29][30][31][32][33][34][35][36][37][38]. Although we do not evaluate in this work, we expect that RawHash can be used as a low-cost filter to eliminate the reads that are unlikely to be useful in downstream analysis, which can reduce the overall workload of basecallers and further downstream analysis.…”

Section: Discussionmentioning

confidence: 99%

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

Fırtına

Ghiasi

Lindegger

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping time, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.

show abstract

Section: Discussionmentioning

confidence: 99%

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

Fırtına

Ghiasi

Lindegger

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Modern deep learning-based basecallers [20, 21, 36, 38, 39, 48, 52–54] incorporate skip connections to help mitigate the vanishing gradient and saturation problems [55]. Removing skip connections has a higher impact on basecalling accuracy.…”

Section: Methodsmentioning

confidence: 99%

“…However, these deep learning models use millions of model parameters that makes basecalling computationally extensive. Recent works propose algorithmic optimizations [33, 48] and hardware accelerators [85] to improve the performance of basecallers. These works accelerate the basecalling step without eliminating the wasted computation in basecalling.…”

Section: Related Workmentioning

confidence: 99%

“…We develop LightCall by modifying the state-of-the-art basecaller Bonito's architecture in three ways: 1) reducing the channel sizes of convolution layers, 2) removing the skip connections, and 3) reducing the number of basic computation blocks. Prior work [48] shows that Bonito's model is over provisioned, and we can maintain very high accuracy with reduced model sizes. We leverage this finding in TargetCall to reduce the channel sizes of convolution layers and number of basic computation blocks of LightCall.…”

Section: Lightcallmentioning

confidence: 99%

“…Our LightCall architecture is composed of 18 convolution blocks containing ∼292 thousand model parameters (∼33.35× lower parameters than Bonito). Modern deep learning-based basecallers [20,21,36,38,39,48,[52][53][54] incorporate skip connections to help mitigate the vanishing gradient and saturation problems [55]. Removing skip connections has a higher impact on basecalling accuracy.…”

Section: Lightcallmentioning

confidence: 99%

See 2 more Smart Citations

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Banu

Singh

Alser

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall to aid future research in pre-basecalling filtering at https://github.com/CMU-SAFARI/TargetCall.

show abstract

Demeter: A Fast and Energy-Efficient Food Profiler Using Hyperdimensional Computing in Memory

et al. 2022

View full text Add to dashboard Cite

Food profiling is an essential step in any food monitoring system needed to prevent health risks and potential frauds in the food industry. Significant improvements in sequencing technologies are pushing food profiling to become the main computational bottleneck. State-of-the-art profilers are unfortunately too costly for food profiling. Our goal is to design a food profiler that solves the main limitations of existing profilers, namely (1) working on massive data structures and (2) incurring considerable data movement, for a real-time monitoring system. To this end, we propose Demeter, the first platform-independent framework for food profiling. Demeter overcomes the first limitation through the use of hyperdimensional computing (HDC) and efficiently performs the accurate few-species classification required in food profiling. We overcome the second limitation by the use of an in-memory hardware accelerator for Demeter (named Acc-Demeter) based on memristor devices. Acc-Demeter actualizes several domain-specific optimizations and exploits the inherent characteristics of memristors to improve the overall performance and energy consumption of Acc-Demeter. We compare Demeter's accuracy with other industrial food profilers using detailed software modeling. We synthesize Acc-Demeter's required hardware using UMC's 65nm library by considering an accurate PCM model based on silicon-based prototypes. Our evaluations demonstrate that Acc-Demeter achieves a (1) throughput improvement of 192× and 724× and (2) memory reduction of 36× and 33× compared to Kraken2 and MetaCache (2 state-of-the-art profilers), respectively, on typical food-related databases. Demeter maintains an acceptable profiling accuracy (within 2% of existing tools) and incurs a very low area overhead.

show abstract

A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Cited by 5 publications

References 116 publications

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Demeter: A Fast and Energy-Efficient Food Profiler Using Hyperdimensional Computing in Memory

Contact Info

Product

Resources

About