Enabling information extraction by inference of regular expressions from sample entities

Brauer, Falk; Rieger, Robert; Mocan, Adrian; Barczyñski, Wojciech M.

doi:10.1145/2063576.2063763

Cited by 48 publications

(53 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each transducer consists of a set of pattern-action rules, the actions being new annotations over the matched text (see Figure 1). The rule-based methods described in Section 2.2 are in principle capable of generating some valid rules for this standard, but they usually employ only character features to generate regular expressions (Li et al 2008;Brauer et al 2011), or only token features, predefined (Soderland 1999;Thompson et al 1999) or not (Ciravegna and Wilks 2003;Nagesh and Chiticariu 2012). There are some approaches like (Wu and Pottenger 2005) that use both types of features, but they can-not be customized.…”

Section: Representation Of Patternsmentioning

confidence: 99%

A Semi-automatic and low-cost method to learn patterns for named entity recognition

Marrero¹,

Urbano²

2017

Nat. Lang. Eng.

View full text Add to dashboard Cite

Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

show abstract

Section: Representation Of Patternsmentioning

confidence: 99%

A Semi-automatic and low-cost method to learn patterns for named entity recognition

Marrero¹,

Urbano²

2017

Nat. Lang. Eng.

View full text Add to dashboard Cite

show abstract

“…We assess our proposal on several datasets representative of possible applications of our similarity learning method (the name of each dataset describes the nature of the data and the type of the entities to be extracted): HTML-href [14,13,11], Log-MAC+IP [14,13,11], Email-Phone [14,13,11,8,7], Bills-Date [14,12], Web-URL [14,13,11,7], Twitter-URL [14,13,11]. Each dataset consists of a text annotated with all and only the snippets that should be extracted.…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…Devising a similarity function capable of capturing syntactic patterns is an important problem as it may enable significant improvements in methods for constructing syntax-based entity extractors from examples automatically [4][5][6][7][8][9][10][11][12][13][14]. We are not aware of any similarity definition capable of (approximately) separating strings which adhere to a common syntactic pattern (e.g., telephone numbers, or email addresses) from strings which do not.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Syntactical Similarity Learning by Means of Grammatical Evolution

Bartoli

Lorenzo

Medvet

et al. 2016

Parallel Problem Solving From Nature – PPSN XIV

View full text Add to dashboard Cite

Abstract. Several research efforts have shown that a similarity function synthesized from examples may capture an application-specific similarity criterion in a way that fits the application needs more effectively than a generic distance definition. In this work, we propose a similarity learning algorithm tailored to problems of syntax-based entity extraction from unstructured text streams. The algorithm takes in input pairs of strings along with an indication of whether they adhere or not adhere to the same syntactic pattern. Our approach is based on Grammatical Evolution and explores systematically a similarity definition space including all functions that may be expressed with a specialized, simple language that we have defined for this purpose. We assessed our proposal on patterns representative of practical applications. The results suggest that the proposed approach is indeed feasible and that the learned similarity function is more effective than the Levenshtein distance and the Jaccard similarity index.

show abstract

“…GATE 6 provides the JAPE language that recognizes regular expressions over annotations. Other systems focus on reducing manual effort for developing extractors (Brauer et al, 2011;Li et al, 2011). In contrast, our tool focuses on visualizing and comparing diagnostic information associated with pattern learning systems.…”

Section: Related Workmentioning

confidence: 99%

SPIED: Stanford Pattern based Information Extraction and Diagnostics

Gupta¹,

Manning²

2014

Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces

View full text Add to dashboard Cite

This paper aims to provide an effective interface for progressive refinement of pattern-based information extraction systems. Pattern-based information extraction (IE) systems have an advantage over machine learning based systems that patterns are easy to customize to cope with errors and are interpretable by humans. Building a pattern-based system is usually an iterative process of trying different parameters and thresholds to learn patterns and entities with high precision and recall. Since patterns are interpretable to humans, it is possible to identify sources of errors, such as patterns responsible for extracting incorrect entities and vice-versa, and correct them. However, it involves time consuming manual inspection of the extracted output. We present a light-weight tool, SPIED, to aid IE system developers in learning entities using patterns with bootstrapping, and visualizing the learned entities and patterns with explanations. SPIED is the first publicly available tool to visualize diagnostic information of multiple pattern learning systems to the best of our knowledge.

show abstract

Enabling information extraction by inference of regular expressions from sample entities

Cited by 48 publications

References 13 publications

A Semi-automatic and low-cost method to learn patterns for named entity recognition

A Semi-automatic and low-cost method to learn patterns for named entity recognition

Syntactical Similarity Learning by Means of Grammatical Evolution

SPIED: Stanford Pattern based Information Extraction and Diagnostics

Contact Info

Product

Resources

About