SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications

Zhong, Zexuan; Guo, Jiaqi; Yang, Wei; Peng, Jian; Xie, Tao; Lou, Jian; Liu, Ting; Zhang, Dongmei

doi:10.18653/v1/d18-1189

Cited by 29 publications

(36 citation statements)

References 23 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Instead of using input-output examples, there are other approaches that synthesize regexes solely from natural language [9,12,27]. We see these approaches as orthogonal to ours and expect that Forest can be improved by hints provided by a natural language component such as was done in Regel.…”

Section: Related Workmentioning

confidence: 90%

“…Form validations often rely on complex regexes which require programming skills that not all users possess. To help users write regexes, prior work has proposed to synthesize regular expressions from natural language [1,9,12,27] or from positive and negative examples [1,7,10,26]. Even though these techniques assist users in writing regexes for search and replace operations, they do not specifically target digital form validation and do not take advantage of the structured format of the data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Ferreira

Terra-Neves²,

Ventura³

et al. 2021

Tools and Algorithms for the Construction and Analysis of Systems

View full text Add to dashboard Cite

Form validators based on regular expressions are often used on digital forms to prevent users from inserting data in the wrong format. However, writing these validators can pose a challenge to some users.We present Forest, a regular expression synthesizer for digital form validations. Forest produces a regular expression that matches the desired pattern for the input values and a set of conditions over capturing groups that ensure the validity of integer values in the input. Our synthesis procedure is based on enumerative search and uses a Satisfiability Modulo Theories (SMT) solver to explore and prune the search space. We propose a novel representation for regular expressions synthesis, multi-tree, which induces patterns in the examples and uses them to split the problem through a divide-and-conquer approach. We also present a new SMT encoding to synthesize capture conditions for a given regular expression. To increase confidence in the synthesized regular expression, we implement user interaction based on distinguishing inputs.We evaluated Forest on real-world form-validation instances using regular expressions. Experimental results show that Forest successfully returns the desired regular expression in 70% of the instances and outperforms Regel, a state-of-the-art regular expression synthesizer.

show abstract

Section: Related Workmentioning

confidence: 90%

Section: Introductionmentioning

confidence: 99%

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Ferreira

Terra-Neves²,

Ventura³

et al. 2021

Tools and Algorithms for the Construction and Analysis of Systems

View full text Add to dashboard Cite

show abstract

“…The average accuracy of 10 evaluations is given. The distinguishing test cases method is based on the membership test of samples for the case when an oracle is not available and is described in (Zhong et al, 2018a). The accuracy of SoftRegex is similar to or better than SemRegex (Oracle) and always better than Deep Regex and SemRegex (Distinguishing Test Cases).…”

Section: Model Performancementioning

confidence: 99%

“…Recently, Locascio et al (2016) designed the Deep-Regex model based on the sequence-to-sequence (Seq2Seq) model (Sutskever et al, 2014) using minimal domain knowledge during the learning phase while still accurately predicting regular expressions from NLs. Later, Zhong et al (2018a) improved the performance by training on not only syntactic content of the expressions (i.e. the exact textual representation of the expression that was used), but also the semantic content (the regular language described by the expression).…”

Section: Introductionmentioning

confidence: 99%

“…the exact textual representation of the expression that was used), but also the semantic content (the regular language described by the expression). However, the reward function in the SemRegex model (Zhong et al, 2018a) that determines if the predicted regular expression is semantically equivalent to the ground truth expression is known to be PSPACE-complete and is a bottleneck in practice (Stockmeyer and Meyer, 1973). Thus, if we can solve this problem (even approximately) more quickly, then we can decrease the required learning time in the natural-language-to-regular expression (NL-RX) model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SoftRegex: Generating Regex from Natural Language Descriptions using Softened Regex Equivalence

Park¹,

Ko²,

Cognetta³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

We continue the study of generating semantically correct regular expressions from natural language descriptions (NL). The current stateof-the-art model, SemRegex, produces regular expressions from NLs by rewarding the reinforced learning based on the semantic (rather than syntactic) equivalence between two regular expressions. Since the regular expression equivalence problem is PSPACE-complete, we introduce the EQ Reg model for computing the similarity of two regular expressions using deep neural networks. Our EQ Reg model essentially softens the equivalence of two regular expressions when used as a reward function. We then propose a new regex generation model, SoftRegex, using the EQ Reg model, and empirically demonstrate that SoftRegex substantially reduces the training time (by a factor of at least 3.6) and produces state-ofthe-art results on three benchmark datasets.

show abstract

Demystifying regular expression bugs

Wang¹,

Brown

Jennings

et al. 2021

Empir Software Eng

View full text Add to dashboard Cite

Regular expressions cause string-related bugs and open security vulnerabilities for DOS attacks. However, beyond ReDoS (Regular expression Denial of Service), little is known about the extent to which regular expression issues affect software development and how these issues are addressed in practice. We conduct an empirical study of 356 regex-related bugs from merged pull requests in Apache, Mozilla, Facebook, and Google GitHub repositories. We identify and classify the nature of the regular expression problems, the fixes, and the related changes in the test code. The most important findings in this paper are as follows: 1) incorrect regular expression semantics is the dominant root cause of regular expression bugs (165/356, 46.3%). The remaining root causes are incorrect API usage (9.3%) and other code issues that require regular expression changes in the fix (29.5%), 2) fixing regular expression bugs is nontrivial as it takes more time and more lines of code to fix them compared to the general pull requests, 3) most (51%) of the regex-related pull requests do not contain test code changes. Certain regex bug types (e.g., compile error, performance issues, regex representation) are less likely to include test code changes than others, and 4) the dominant type of test code changes in regex-related pull requests is test case addition (75%). The results of this study contribute to a broader understanding of the practical problems faced by developers when using, fixing, and testing regular expressions. Keywords Regular expression bug characteristics• Pull requests • Bug fixes • Test code

show abstract

SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications

Cited by 29 publications

References 23 publications

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

SoftRegex: Generating Regex from Natural Language Descriptions using Softened Regex Equivalence

Demystifying regular expression bugs

Contact Info

Product

Resources

About