In this paper we present a novel architecture for high-speed and high-capacity regular expression matching (REM) on FPGA. The proposed REM architecture, based on nondeterministic finite automaton (RE-NFA), efficiently constructs regular expression matching engines (REME) of arbitrary regular patterns and character classes in a uniform structure, utilizing both logic slices and block memory (BRAM) available on modern FPGA devices. The resulting circuits take advantage of synthesis and routing optimizations to achieve high operating speed and area efficiency. The uniform structure of our RE-NFA design can be stacked in a simple way to produce multi-character input circuits to scale up throughput further. An n-state m-character input REME takes only O (n × log 2 m) time to construct and occupies no more than O (n × m) logic units. The REMEs can be staged and pipelined in large numbers to achieve high parallelism without sacrificing clock frequency.Using the proposed RE-NFA architecture, we are able to implement 3 copies of two-character input REMEs, each with 760 regular expressions, 18715 states and 371 character classes, onto a single Xilinx Virtex 4 LX-100-12 device. Each copy processes 2 characters per clock cycle at 300 MHz, resulting in a concurrent throughput of 14.4 Gbps for 760 REMEs. Compared with the automatic NFA-to-VHDL REME compilation [13], our approach achieves over 9x throughput efficiency (Gbps*state/LUT). Compared with state-of-the-art REMEs on FPGA, our approach also indicates up to 70% better throughput efficiency.
Multi-field packet classification is a key enabling function of a variety of network applications, such as firewall processing, Quality of Service differentiation, traffic billing, and other value added services. Although a plethora of research has been done in this area, wire-speed packet classification while supporting large rule sets remains difficult. This paper exploits the features provided by current FPGAs and proposes a decision-tree-based, two-dimensional dual-pipeline architecture for multi-field packet classification. To fit the current largest rule set in the on-chip memory of the FPGA device, we propose several optimization techniques for the stateof-the-art decision-tree-based algorithm, so that the memory requirement is almost linear with the number of rules. Specialized logic is developed to support varying number of branches at each decision tree node. A tree-to-pipeline mapping scheme is carefully designed to maximize the memory utilization. Since our architecture is linear and memorybased, on-the-fly update without disturbing the ongoing operations is feasible. The implementation results show that our architecture can store 10K real-life rules in on-chip memory of a single Xilinx Virtex-5 FPGA, and sustain 80 Gbps (i.e. 2× OC-768 rate) throughput for minimum size (40 bytes) packets. To the best of our knowledge, this work is the first FPGA-based packet classification engine that achieves wire-speed throughput while supporting 10K unique rules.
Multi-match packet classification is a critical function in network intrusion detection systems (NIDS), where all matching rules for a packet need to be reported. Most of the previous work is based on ternary content addressable memories (TCAMs) which are expensive and are not scalable with respect to clock rate, power consumption, and circuit area. This paper studies the characteristics of real-life Snort NIDS rule sets, and proposes a novel SRAM-based architecture. The proposed architecture is called field-split parallel bit vector (FSBV) where some header fields of a packet are further split into bit-level subfields. Unlike previous multi-match packet classification algorithms which suffer from memory explosion, the memory requirement of FSBV is linear in the number of rules. FPGA technology is exploited to provide high throughput and to support dynamic updates. Implementation results show that our architecture can store on a single Xilinx Virtex-5 FPGA the full set of packet header rules extracted from the latest Snort NIDS and sustains 100 Gbps throughput for minimum size (40 bytes) packets. The design achieves 1.25× improvement in throughput while the power consumption is approximately one fourth that of the state-of-the-art solutions.
Rapid growth in network link rates poses a strong demand on high speed IP lookup engines. Trie-based architectures are natural candidates for pipelined implementation to provide high throughput. However, simply mapping a trie level onto a pipeline stage results in unbalanced memory distribution over different stages. To address this problem, several novel pipelined architectures have been proposed. But their non-linear pipeline structure results in some new performance issues such as throughput degradation and delay variation. In this paper, we propose a simple and effective linear pipeline architecture for trie-based IP lookup. Our architecture achieves evenly distributed memory while realizing high throughput of one lookup per clock cycle. It offers more freedom in mapping trie nodes to pipeline stages by supporting nops. We implement our design as well as the state-of-the-art solutions on a commodity FPGA and evaluate their performance. Post place and route results show that our design can achieve a throughput of 80 Gbps, up to twice the throughput of reference solutions. It has constant delay, maintains input order, and supports incremental route updates without disrupting the ongoing IP lookup operations.
Abstract-Continuous growth in network link rates poses a strong demand on high speed IP lookup engines. While Ternary Content Addressable Memory (TCAM) based solutions serve most of today's high-end routers, they do not scale well for the next-generation [1]. On the other hand, pipelined SRAMbased algorithmic solutions become attractive. Intuitively multiple pipelines can be utilized in parallel to have a multiplicative effect on the throughput. However, several challenges must be addressed for such solutions to realize high throughput. First, the memory distribution across different stages of each pipeline as well as across different pipelines must be balanced. Second, the traffic on various pipelines should be balanced.In this paper, we propose a parallel SRAM-based multipipeline architecture for terabit IP lookup. To balance the memory requirement over the stages, a two-level mapping scheme is presented. By trie partitioning and subtrie-to-pipeline mapping, we ensure that each pipeline contains approximately equal number of trie nodes. Then, within each pipeline, a fine-grained node-to-stage mapping is used to achieve evenly distributed memory across the stages. To balance the traffic on different pipelines, both pipelined prefix caching and dynamic subtrie-topipeline remapping are employed. Simulation using real-life data shows that the proposed architecture with 8 pipelines can store a core routing table with over 200K unique routing prefixes using 3.5 MB of memory. It achieves a throughput of up to 3.2 billion packets per second, i.e. 1 Tbps for minimum size (40 bytes) packets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.