Why GPUs are Slow at Executing NFAs and How to Make them Faster

Liu, Hongyuan; Pai, Sreepathi; Jog, Adwait

doi:10.1145/3373376.3378471

Cited by 18 publications

(9 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many works aim to exploit these parallelism degrees on top of a massively parallel architecture such as GPUs. We exclude closed-source approaches [17], [36], while selecting the approach proposed by Liu et al [31] which exposes several state-of-the-art GPU-based methodologies.…”

Section: Gpu-based Enginesmentioning

confidence: 99%

“…We select the available engines as described in §II and represented in A of Figure 1. Indeed, YARB's engines currently are: RE2 [30], Hyperscan [14], and the ones presented by Liu et al [31]. RE2 is a C ++ general-purpose library that guarantees execution time linear in the input length and fixed stack footprint, able to target any CPU regardless the ISA.…”

Section: A Regular Expression Enginesmentioning

confidence: 99%

“…Since we adopt REs as universal language, we devise a novel Automata to REs translator that can optionally apply different minimization algorithms and explore solutions' performance. We analyze open-source literature for RE matching on heterogeneous systems and select the ones working on every scenario [14], [30], [31], providing relevant insights. In summary, this work's contributions are:…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

YARB: a Methodology to Characterize Regular Expression Matching on Heterogeneous Systems

Carloni,

Conficconi,

Moschetto

et al. 2023

2023 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

The continuous growth of data pushes novel and efficient approaches for information retrieval. In this context, Regular Expression (RE) matching is widely employed and represents a relevant computational kernel that carries controland memory-related issues. Among the several solutions to relieve these burdens, accelerators seem a promising alternative to general-purpose systems. However, state-of-the-art benchmarking presents a highly fragmented scenario without consensus on the approach and lacks an open-source strategy. Therefore, to fairly characterize existing execution engines, this work presents YARB, an open benchmarking methodology. It builds upon literature solutions, a comprehensive approach, and an in-depth characterization of heterogeneous systems. Moreover, YARB's openness will enable future integrations and engines comparison.

show abstract

Section: Gpu-based Enginesmentioning

confidence: 99%

Section: A Regular Expression Enginesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

YARB: a Methodology to Characterize Regular Expression Matching on Heterogeneous Systems

Carloni,

Conficconi,

Moschetto

et al. 2023

2023 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

show abstract

“…Our notion of counter-ambiguity is formulated more generally, and our simulation based on bit vectors handles character class ambiguity. Finally, there are several works that implement regex matching algorithms on GPUs [14,29,60,70].…”

Section: Related Workmentioning

confidence: 99%

Software-hardware codesign for efficient in-memory regular pattern matching

Kong

Chattopadhyay

et al. 2022

Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

Regular pattern matching is used in numerous application domains, including text processing, bioinformatics, and network security. Patterns are typically expressed with an extended syntax of regular expressions. This syntax includes the computationally challenging construct of bounded repetition or counting, which describes the repetition of a pattern a fixed number of times. We develop a specialized in-memory hardware architecture that integrates counter and bit vector modules into a state-of-the-art in-memory NFA accelerator. The design is inspired by the theoretical model of nondeterministic counter automata (NCA). A key feature of our approach is that we statically analyze regular expressions to determine bounds on the amount of memory needed for the occurrences of bounded repetition. The results of this analysis are used by a regex-to-hardware compiler in order to make an appropriate selection of counter or bit vector modules. We evaluate our hardware implementation using a simulator based on circuit parameters collected by SPICE simulation in TSMC 28nm CMOS process. We find that the use of counter and bit vector modules outperforms unfolding

show abstract

“…Specialized approaches instead focus on a selected application and exploit the characteristics of it. Works on Non-deterministic finite automaton (NFA) propose to dynamically employ the GPU shared memory to store frequently used sizable lookup tables [91]. Many specialized works have focused on GPU execution of irregular Sparse Matrix Vector Multiplication (SpMV) and Matrix Matrix Multiplication (GEMM) by proposing software approaches that reorder the matrices dataset [119], algorithms tailored for specific data characteristics of the matrices [127], and row reordering techniques [69] to improve data locality among processed rows.…”

Section: Memory Divergencementioning

confidence: 99%

High-performance and energy-efficient irregular graph processing on GPU architectures

Segura Salvador

View full text Add to dashboard Cite

Graph processing is an established and prominent domain that is the foundation of new emerging applications in areas such as Data Analytics and Machine Learning, empowering applications such as road navigation, social networks and automatic speech recognition. The large amount of data employed in these domains requires high throughput architectures such as GPGPU. Although the processing of large graph-based workloads exhibits a high degree of parallelism, memory access patterns tend to be highly irregular, leading to poor efficiency due to memory divergence.In order to ameliorate these issues, GPGPU graph applications perform stream compaction operations which process active nodes/edges so subsequent steps work on a compacted dataset. We propose to offload this task to the Stream Compaction Unit (SCU) hardware extension tailored to the requirements of these operations, which additionally performs pre-processing by filtering and reordering elements processed.We show that memory divergence inefficiencies prevail in GPGPU irregular graph-based applications, yet we find that it is possible to relax the strict relationship between thread and processed data to empower new optimizations. As such, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension integrated in the GPU pipeline that reorders and filters data processed by the threads on irregular accesses improving memory coalescing.Finally, we leverage the strengths of both previous approaches to achieve synergistic improvements. We do so by proposing the IRU-enhanced SCU (ISCU), which employs the efficient pre-processing mechanisms of the IRU to improve SCU stream compaction efficiency and NoC throughput limitations due to SCU pre-processing operations. We evaluate the ISCU with state-of-the-art graph-based applications achieving a 2.2x performance improvement and 10x energy-efficiency. El processament de grafs és un domini prominent i establert com a la base de noves aplicacions emergents en àrees com l'anàlisi de dades i Machine Learning, que permeten aplicacions com ara navegació per carretera, xarxes socials i reconeixement automàtic de veu. La gran quantitat de dades emprades en aquests dominis requereix d’arquitectures d’alt rendiment, com ara GPGPU. Tot i que el processament de grans càrregues de treball basades en grafs presenta un alt grau de paral·lelisme, els patrons d’accés a la memòria tendeixen a ser irregulars, fet que redueix l’eficiència a causa de la divergència d’accessos a memòria. Per tal de millorar aquests problemes, les aplicacions de grafs per a GPGPU realitzen operacions de stream compaction que processen nodes/arestes per tal que els passos posteriors funcionin en un conjunt de dades compactat. Proposem deslliurar d’aquesta tasca a la extensió hardware Stream Compaction Unit (SCU) adaptada als requisits d’aquestes operacions, que a més realitza un pre-processament filtrant i reordenant els elements processats.Mostrem que les ineficiències de divergència de memòria prevalen en aplicacions GPGPU basades en grafs irregulars, tot i que trobem que és possible relaxar la relació estricta entre threads i les dades processades per obtenir noves optimitzacions. Com a tal, proposem la Irregular accesses Reorder Unit (IRU), una nova extensió de maquinari integrada al pipeline de la GPU que reordena i filtra les dades processades pels threads en accessos irregulars que milloren la convergència d’accessos a memòria. Finalment, aprofitem els punts forts de les propostes anteriors per aconseguir millores sinèrgiques. Ho fem proposant la IRU-enhanced SCU (ISCU), que utilitza els mecanismes de pre-processament eficients de la IRU per millorar l’eficiència de stream compaction de la SCU i les limitacions de rendiment de NoC a causa de les operacions de pre-processament de la SCU.

show abstract

Why GPUs are Slow at Executing NFAs and How to Make them Faster

Cited by 18 publications

References 51 publications

YARB: a Methodology to Characterize Regular Expression Matching on Heterogeneous Systems

YARB: a Methodology to Characterize Regular Expression Matching on Heterogeneous Systems

Software-hardware codesign for efficient in-memory regular pattern matching

High-performance and energy-efficient irregular graph processing on GPU architectures

Contact Info

Product

Resources

About