Predicting accuracy of extracting information from unstructured text collections

Agichtein, Eugene; Cucerzan, Silviu

doi:10.1145/1099554.1099678

Cited by 28 publications

(32 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Resource selection approaches generally consist of two steps: (1) build a compact, representative collection summary (e.g., consisting of word frequency vectors [11,15] or document samples [30,32]); (2) relevance estimation: to process a given query, use the collection descriptors to estimate the number of topically relevant documents in each collection, and rank the collections accordingly. Unlike in distributed IR, our IE scenario requires that we identify collections with useful documents for the IE task, rather than collections with documents that are topically relevant to a given query.…”

Section: Problem Definitionmentioning

confidence: 99%

See 1 more Smart Citation

Ranking Deep Web Text Collections for Scalable Information Extraction

Barrio

Gravano

Develder

2015

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Information extraction (IE) systems discover structured information from natural language text, to enable much richer querying and data mining than possible directly over the unstructured text. Unfortunately, IE is generally a computationally expensive process, and hence improving its efficiency, so that it scales over large volumes of text, is of critical importance. State-of-the-art approaches for scaling the IE process focus on one text collection at a time. These approaches prioritize the extraction effort by learning keyword queries to identify the "useful" documents for the IE task at hand, namely, those that lead to the extraction of structured "tuples." These approaches, however, do not attempt to predict which text collections are useful for the IE task-and hence merit further processing-and which ones will not contribute any useful output-and hence should be ignored altogether, for efficiency. In this paper, we focus on an especially valuable family of text sources, the so-called deep web collections, whose (remote) contents are only accessible via querying. Specifically, we introduce and study techniques for ranking deep web collections for an IE task, to prioritize the extraction effort by focusing on collections with substantial numbers of useful documents for the task. We study both (adaptations of) state-of-the-art resource selection strategies for distributed information retrieval, and IE-specific approaches. Our extensive experimental evaluation over realistic deep web collections, and for several different IE tasks, shows the merits and limitations of the alternative families of approaches, and provides a roadmap for addressing this critically important building block for efficient, scalable information extraction.

show abstract

Section: Problem Definitionmentioning

confidence: 99%

“…Earlier efforts to identify collections for an extraction task (e.g., [1,21]) have focused on examining the quality of the extraction output, rather than its volume. The (complementary) methods described in this paper can be adapted to consider quality (see Section 6).…”

Section: Problem Definitionmentioning

confidence: 99%

Ranking Deep Web Text Collections for Scalable Information Extraction

Barrio

Gravano

Develder

2015

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…Related effort to this paper is [1], which presents an approach to examine the quality of a relation that could be generated using an extraction system over a text database. Specifically, [1] builds language models for a text database and compares them against those for an extraction system to examine the relation quality.…”

Section: Related Workmentioning

confidence: 99%

“…Specifically, [1] builds language models for a text database and compares them against those for an extraction system to examine the relation quality. Our proposed algorithms are comparatively lightweight in that we eliminate the need for any such (potentially expensive) text analysis or the need for any apriori database-or extraction-related knowledge.…”

Section: Related Workmentioning

confidence: 99%

Exploring a Few Good Tuples from Text Databases

Jain

Srivastava

2009

2009 IEEE 25th International Conference on Data Engineering

View full text Add to dashboard Cite

Abstract-Information extraction from text databases is a useful paradigm to populate relational tables and unlock the considerable value hidden in plain-text documents. However, information extraction can be expensive, due to various complex text processing steps necessary in uncovering the hidden data. There are a large number of text databases available, and not every text database is necessarily relevant to every relation. Hence, it is important to be able to quickly explore the utility of running an extractor for a specific relation over a given text database before carrying out the expensive extraction task. In this paper, we present a novel exploration methodology of finding a few good tuples for a relation that can be extracted from a database which allows for judging the relevance of the database for the relation. Specifically, we propose the notion of a good(k, ) query as one that can return any k tuples for a relation among the top-fraction of tuples ranked by their aggregated confidence scores, provided by the extractor; if these tuples have high scores, the database can be determined as relevant to the relation. We formalize the access model for information extraction, and investigate efficient query processing algorithms for good(k, ) queries, which do not rely on any prior knowledge about the extraction task or the database. We demonstrate the viability of our algorithms using a detailed experimental study with real text databases.

show abstract

“…To discover relations among two named entities, a number of works [2], [3], [22] proposed methods to identify relations using context words between them. In [23], Agichtein and Cucerzan claimed that relation extraction from text documents was a harder task than named entity recognition. They proposed a general language modeling method for quantifying the difficulty of information extraction by predicting performance of named entity recognition such as location, organization, person name and miscellaneous named entities, and relation extraction such as birth dates, death dates and invention name.…”

Section: Introductionmentioning

confidence: 99%

Discovery of Predicate-Oriented Relations among Named Entities Extracted from Thai Texts

Tongtep

Theeramunkong

2012

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Nattapong TONGTEP †a) , Student Member and Thanaruk THEERAMUNKONG †b) , Member SUMMARYExtracting named entities (NEs) and their relations is more difficult in Thai than in other languages due to several Thai specific characteristics, including no explicit boundaries for words, phrases and sentences; few case markers and modifier clues; high ambiguity in compound words and serial verbs; and flexible word orders. Unlike most previous works which focused on NE relations of specific actions, such as work for, live in, located in, and kill, this paper proposes more general types of NE relations, called predicate-oriented relation (PoR), where an extracted action part (verb) is used as a core component to associate related named entities extracted from Thai Texts. Lacking a practical parser for the Thai language, we present three types of surface features, i.e. punctuation marks (such as token spaces), entity types and the number of entities and then apply five alternative commonly used learning schemes to investigate their performance on predicate-oriented relation extraction. The experimental results show that our approach achieves the F-measure of 97.76%, 99.19%, 95.00% and 93.50% on four different types of predicate-oriented relation (action-location, location-action, action-person and person-action) in crime-related news documents using a data set of 1,736 entity pairs. The effects of NE extraction techniques, feature sets and class unbalance on the performance of relation extraction are explored. key words: relation extraction, named entity, surface feature, information extraction IntroductionRecently several information extraction (IE) approaches have been proposed to transform an unstructured text into knowledge base, such as those in [1 [25] presented a so-called CORDER system to find relations among entities in an organization's documents on a social network. The mined knowledge was in the form of who works with whom, on which projects and with which customers, using strength measured for each co-occurring NE based on its co-occurrences and distances with the target. The CORDER comprised the steps of data selection, named entity recognition and ranking by relation strengths.As an integrated community project, tasks of entity and relation extraction from English, Chinese and Arabic texts were conducted in the Automatic Content Extraction (ACE) program * , including three sets of annotation tasks; Entity Detection and Tracking (EDT), Relation Detection and Characterization (RDC), and Event Detection and Characterization (EDC) [26]. Three main EDT tasks were the detection of entities mentioned in a document, the tracking of entities * http://projects.ldc.upenn.edu/ace/

show abstract

Predicting accuracy of extracting information from unstructured text collections

Cited by 28 publications

References 23 publications

Ranking Deep Web Text Collections for Scalable Information Extraction

Ranking Deep Web Text Collections for Scalable Information Extraction

Exploring a Few Good Tuples from Text Databases

Discovery of Predicate-Oriented Relations among Named Entities Extracted from Thai Texts

Contact Info

Product

Resources

About