Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.
Deep packet inspection (DPI) is one of the crucial tasks in modern intrusion detection and intrusion prevention systems. It allows the inspection of packet payload using patterns. Modern DPI based systems use regular expressions to define these patterns. Deterministic finite automata (DFA) is considered to be an ideal choice for performing regular expression matching due to its O(1) processing complexity. However, DFAs consume large memory to store its state transition table, but this problem gets worsened when stride level of the DFA is increased. Though, increasing stride level brings significant increase in the overall speedup of the matching engine but as a tradeoff it consumes large memory.In this paper, we present stride-k speculative parallel pattern matching (SPPM), a technique in which a packet is first split up into two chunks and then multiple bytes per chunk are inspected at a time using stride-k DFA. Furthermore, we present a stride-k DFA compression technique using alphabet compression table (ACT) to reduce the memory requirements of stride-k DFA. We have implemented the single threaded algorithm for stride-2 SPPM. Results show that the use of stride-2 SPPM can overall increase the pattern matching speed by up to 30% as compared to traditional DFA matching, and a significant reduction of over 70% in the number of iterations required for packet processing. Secondly, over 65% reduction in the number of transitions has been achieved using ACT for stride-2 DFA implementation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.