CUDA acceleration of P7Viterbi algorithm in HMMER 3.0

Quirem, Saddam; Ahmed, Fahian; Lee, Byeong Kil

doi:10.1109/pccc.2011.6108104

Cited by 7 publications

(2 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The main loop of this function is amenable to vectorization as it exhibits a high degree of data parallelism [42]. Modern ARM processors support the NEON instruction set extension which operate on four scalar values at a time.…”

Section: A Auto-vectorizationmentioning

confidence: 99%

Leakage-Resilient Layout Randomization for Mobile Devices

Braden¹,

Crane

Davi

et al. 2016

Proceedings 2016 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

Abstract-Attack techniques based on code reuse continue to enable real-world exploits bypassing all current mitigations. Code randomization defenses greatly improve resilience against code reuse. Unfortunately, sophisticated modern attacks such as JIT-ROP can circumvent randomization by discovering the actual code layout on the target and relocating the attack payload on the fly. Hence, effective code randomization additionally requires that the code layout cannot be leaked to adversaries.Previous approaches to leakage-resilient diversity have either relied on hardware features that are not available in all processors, particularly resource-limited processors commonly found in mobile devices, or they have had high memory overheads. We introduce a code randomization technique that avoids these limitations and scales down to mobile and embedded devices: Leakage-Resilient Layout Randomization (LR 2 ).Whereas previous solutions have relied on virtualization, x86 segmentation, or virtual memory support, LR 2 merely requires the underlying processor to enforce a W⊕X policy-a feature that is virtually ubiquitous in modern processors, including mobile and embedded variants. Our evaluation shows that LR 2 provides the same security as existing virtualization-based solutions while avoiding design decisions that would prevent deployment on less capable yet equally vulnerable systems. Although we enforce execute-only permissions in software, LR 2 is as efficient as the best-in-class virtualization-based solution. I. MOTIVATIONThe recent "Stagefright" vulnerability exposed an estimated 950 million Android systems to remote exploitation [21]. Similarly, the "One Class to Rule them All" [40] zero-day vulnerability affected 55% of all Android devices. These are just the most recent incidents in a long series of vulnerabilities that enable attackers to mount code-reuse attacks [37,43] against mobile devices. Moreover, because these devices run scripting capable web browsers, they are also exposed to sophisticated code-reuse attacks that can bypass ASLR and even finegrained code randomization by exploiting information-leakage vulnerabilities [11,20,48,50]. Just-in-time attacks (JIT-ROP) [50] are particularly challenging because they misuse run-time scripting to analyze the target memory layout after randomization and relocate a return-oriented programming (ROP) payload accordingly.There are several alternatives to code randomization aimed to defend against code-reuse attacks, including control-flow integrity (CFI) [1] and code-pointer integrity (CPI) [28]. However, these defenses come with their own set of challenges and tend to have high worst-case performance overheads. We focus on code randomization techniques since they are known to be efficient [18,25] and scalable to complex, real-world applications such as web browsers, language runtimes, and operating system kernels without the need to perform elaborate static program analysis during compilation.Recent code randomization defenses offer varying degrees of resilience to JIT...

show abstract

Section: A Auto-vectorizationmentioning

confidence: 99%

Leakage-Resilient Layout Randomization for Mobile Devices

Braden¹,

Crane

Davi

et al. 2016

Proceedings 2016 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

show abstract

“…Different sequences were assigned to individual threads in both methods. Partial optimization was proposed in [ 19 ], which parallelizes the P7Viterbi part without considering the D - D path dependency. Although this approach claims a 14x speedup than original functions, it sacrifices the sensitivity of probabilistic inference.…”

Section: Introductionmentioning

confidence: 99%

CUDAMPF: a multi-tiered parallel framework for accelerating protein sequence search in HMMER on CUDA-enabled GPU

Jiang

Ganesan

2016

BMC Bioinformatics

View full text Add to dashboard Cite

BackgroundHMMER software suite is widely used for analysis of homologous protein and nucleotide sequences with high sensitivity. The latest version of hmmsearch in HMMER 3.x, utilizes heuristic-pipeline which consists of MSV/SSV (Multiple/Single ungapped Segment Viterbi) stage, P7Viterbi stage and the Forward scoring stage to accelerate homology detection. Since the latest version is highly optimized for performance on modern multi-core CPUs with SSE capabilities, only a few acceleration attempts report speedup. However, the most compute intensive tasks within the pipeline (viz., MSV/SSV and P7Viterbi stages) still stand to benefit from the computational capabilities of massively parallel processors.ResultsA Multi-Tiered Parallel Framework (CUDAMPF) implemented on CUDA-enabled GPUs presented here, offers a finer-grained parallelism for MSV/SSV and Viterbi algorithms. We couple SIMT (Single Instruction Multiple Threads) mechanism with SIMD (Single Instructions Multiple Data) video instructions with warp-synchronism to achieve high-throughput processing and eliminate thread idling. We also propose a hardware-aware optimal allocation scheme of scarce resources like on-chip memory and caches in order to boost performance and scalability of CUDAMPF. In addition, runtime compilation via NVRTC available with CUDA 7.0 is incorporated into the presented framework that not only helps unroll innermost loop to yield upto 2 to 3-fold speedup than static compilation but also enables dynamic loading and switching of kernels depending on the query model size, in order to achieve optimal performance.ConclusionsCUDAMPF is designed as a hardware-aware parallel framework for accelerating computational hotspots within the hmmsearch pipeline as well as other sequence alignment applications. It achieves significant speedup by exploiting hierarchical parallelism on single GPU and takes full advantage of limited resources based on their own performance features. In addition to exceeding performance of other acceleration attempts, comprehensive evaluations against high-end CPUs (Intel i5, i7 and Xeon) shows that CUDAMPF yields upto 440 GCUPS for SSV, 277 GCUPS for MSV and 14.3 GCUPS for P7Viterbi all with 100 % accuracy, which translates to a maximum speedup of 37.5, 23.1 and 11.6-fold for MSV, SSV and P7Viterbi respectively. The source code is available at https://github.com/Super-Hippo/CUDAMPF.

show abstract

Accelerating Search of Protein Sequence Databases using CUDA-Enabled GPU

Cheng¹,

Butler²

2015

Database Systems for Advanced Applications

View full text Add to dashboard Cite

CUDA acceleration of P7Viterbi algorithm in HMMER 3.0

Cited by 7 publications

References 2 publications

Leakage-Resilient Layout Randomization for Mobile Devices

Leakage-Resilient Layout Randomization for Mobile Devices

CUDAMPF: a multi-tiered parallel framework for accelerating protein sequence search in HMMER on CUDA-enabled GPU

Accelerating Search of Protein Sequence Databases using CUDA-Enabled GPU

Contact Info

Product

Resources

About