CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

Flores, Christopher; Figueroa, Rosa L.; Pezoa, Jorge E.; Zeng‐Treitler, Qing

doi:10.1109/access.2020.2972205

Cited by 10 publications

(12 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Traditional regular expression generators [10][11][12][13][14][15][16][17] focus on trying all variations to obtain the most suitable pattern and ignore time efficiency. Moreover, these generators are suitable for different tasks.…”

Section: Heuristic Approach: Regexnmentioning

confidence: 99%

“…Locascio et al [14] use an LSTM-based sequence to sequence a neural network for specialized domain knowledge. Flores et al [15] develop an algorithm for automatically generating regular expressions from biomedical texts using a coarse-to-fine text aligning method. Cui et al [16] design an efficient novel regular expression based text classifier.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A regular expression generator based on CSS selectors for efficient extractionfrom HTML pages

2020

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

Cascading Style Sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a Document Object Model (DOM) tree is constructed by an HTML parser for a web page. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely REGEXN, that automatically generates these patterns through CSS selectors is introduced and the performance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 bytes to 1.59 bytes. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task.

show abstract

Section: Heuristic Approach: Regexnmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A regular expression generator based on CSS selectors for efficient extractionfrom HTML pages

2020

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

show abstract

“…In our contribution, we boost the performance of the Naïve Bayes (NB) classifier because, despite its strong assumptions of independence among attributes, the NB classifier is a popular algorithm among practitioners. It is particularly effective in text classification tasks [5], [19], [35], [55] and popular among researchers of some specific domains. For instance, recently, Niazi et al [48] use NB to monitor and maintain photovoltaic modules; Shen et al [60] use it to handle dependencies in medical ontologies.…”

Section: Related Workmentioning

confidence: 99%

Improving k Nearest Neighbors and Naïve Bayes Classifiers Through Space Transformations and Model Selection

et al. 2020

View full text Add to dashboard Cite

“…Recently, the AL has attracted the interest of researchers and has been applied to classification algorithms based on DNN [15], [16]. However, to our best knowledge, there are no AL query strategies available for identifying the most informative examples for regular-expressions-based biomedical text classifiers, with only some works related to information extraction tasks but in other usage domains [17]- [21]. Based on the above, in this paper we aim to address the following research questions:…”

Section: Introductionmentioning

confidence: 99%

“…The conservative AL query strategy assesses the amount of diversity in the examples through the Smith-Waterman (SW) algorithm to provide a level of uncertainty in cases where regular expressions mismatch. Three datasets written in Spanish were used to evaluate whether the AL decision function effectively achieves the same classification performance when used in conjunction with the Classifier Regular Expression (CREGEX) biomedical text discriminant [21]. Such datasets were obtained from the hospital Guillermo Grant Benavente (HGGB) in Concepción, Chile.…”

Section: Introductionmentioning

confidence: 99%

Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions

2021

Self Cite

View full text Add to dashboard Cite

Biomedical text classification algorithms, which currently support clinical decision-making processes, call for expensive training texts due to the low availability of labeled corpus and the cost of manual annotation by specialized professionals. The active learning (AL) approach to classification heavily lessens such cost by reducing the number of labeled documents required to achieve specified performance. This article introduces a query strategy and a stopping criterion that transform CREGEX, a regular-expressions-based text classification algorithm, in an AL biomedical text classifier. The query strategy samples the training dataset, trading off the greedy learning achieved by the regular expressions classification precision and the conservative learning induced by text sequence alignment classification. The sustained reduction in the variance of the query strategy scores is used as a stopping criterion. The AL classifier was compared with Support Vector Machine (SVM), Naïve Bayes (NB), and a classifier based on Bidirectional Encoder Representations from Transformers (BERT), using three datasets with biomedical information in Spanish on smoking habits, obesity, and obesity types. The learning curve results indicate that AL in CREGEX allowed to efficiently reduce the number of training examples for equal performance than the rest of the classifiers, obtaining areas under the learning curve greater than 85% in all cases. The stopping criterion applied to the AL process allowed to use, on average, approximately 32% to 50% of the total training examples with differences in performance concerning the maximum value of the learning curve not exceeding 2%. This performance demonstrates the effectiveness of using AL in a biomedical text classifier based on regular expressions, which is attributable to such expressions' ability to represent intricate sequential patterns in training texts considered most informative.

show abstract

CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

Cited by 10 publications

References 37 publications

A regular expression generator based on CSS selectors for efficient extractionfrom HTML pages

A regular expression generator based on CSS selectors for efficient extractionfrom HTML pages

Improving k Nearest Neighbors and Naïve Bayes Classifiers Through Space Transformations and Model Selection

Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions

Contact Info

Product

Resources

About