SummaryIn this paper we study the performance of POSIX submatch extraction algorithms based on nondeterministic finite automata (NFA). We propose an algorithm that combines Laurikari tagged NFA and extended Okui‐Suzuki disambiguation. The algorithm works in worst‐case O(n m2 t) time and O(m2) space (including preprocessing), where n is the length of input, m is the size of the regular expression with bounded repetition expanded and t is the number of capturing groups and subexpressions that contain them. On real‐world benchmarks our algorithm performs close to the O(n m t) complexity of leftmost‐greedy matching, although on artificial benchmarks it can be significantly slower. We propose a lazy version of the algorithm that runs much faster, but requires O(n m2) space. We show that the Kuklewicz algorithm is slower in practice, and the backward matching algorithm proposed by Cox is incorrect.
This paper extends the work of Laurikari [Lau00] [Lau01] and Kuklewicz [Kuk07] on tagged deterministic finite automata (TDFA) in the context of submatch extraction in regular expressions. The main goal of this work is application of TDFA to lexer generators that optimize for speed of the generated code. I suggest a number of practical improvements to Laurikari algorithm; notably, the use of one-symbol lookahead, which results in significant reduction of tag variables and operations on them. Experimental results confirm that lookahead-aware TDFA are considerably faster and usually smaller than baseline TDFA; and they are reasonably close in speed and size to ordinary DFA used for recognition of regular languages. The proposed algorithm can handle repeated submatch and therefore is applicable to full parsing. Furthermore, I examine the problem of disambiguation in the case of leftmost greedy and POSIX policies. I formalize POSIX disambiguation algorithm suggested by Kuklewicz and show that the resulting TDFA are as efficient as Laurikari TDFA or TDFA that use leftmost greedy disambiguation. All discussed algorithms are implemented in the open source lexer generator RE2C.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.