Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing - EMNLP '06 2006
DOI: 10.3115/1610075.1610160
|View full text |Cite
|
Sign up to set email alerts
|

Learning field compatibilities to extract database records from unstructured text

Abstract: Named-entity recognition systems extract entities such as people, organizations, and locations from unstructured text. Rather than extract these mentions in isolation, this paper presents a record extraction system that assembles mentions into records (i.e. database tuples). We construct a probabilistic model of the compatibility between field values, then employ graph partitioning algorithms to cluster fields into cohesive records. We also investigate compatibility functions over sets of fields, rather than s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
11
0

Year Published

2007
2007
2017
2017

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 18 publications
(11 citation statements)
references
References 16 publications
0
11
0
Order By: Relevance
“…Cross-sentence relation extraction Several relation extraction tasks have benefited from crosssentence extraction, including MUC fact and event extraction (Swampillai and Stevenson, 2011), record extraction from web pages (Wick et al, 2006), extraction of facts for biomedical domains (Yoshikawa et al, 2011), and extensions of semantic role labeling to cover implicit inter-sentential arguments (Gerber and Chai, 2010). These prior works have either relied on explicit co-reference annotation, or on the assumption that the whole document refers to a single coherent event, to simplify the problem and reduce the need for powerful representations of multi-sentential contexts of entity mentions.…”
Section: Binary Relation Extractionmentioning
confidence: 99%
“…Cross-sentence relation extraction Several relation extraction tasks have benefited from crosssentence extraction, including MUC fact and event extraction (Swampillai and Stevenson, 2011), record extraction from web pages (Wick et al, 2006), extraction of facts for biomedical domains (Yoshikawa et al, 2011), and extensions of semantic role labeling to cover implicit inter-sentential arguments (Gerber and Chai, 2010). These prior works have either relied on explicit co-reference annotation, or on the assumption that the whole document refers to a single coherent event, to simplify the problem and reduce the need for powerful representations of multi-sentential contexts of entity mentions.…”
Section: Binary Relation Extractionmentioning
confidence: 99%
“…Being inspired by tagging problems common in bio-informatics and other areas, these approaches traditionally require some form of supervision. Many require an initial seed of correctly segmented records [10], [21], [23], [26], [37], while others require positive and negative examples of valid field/column values as training data [24], [32], sometimes leveraging existing knowledge bases [9], [30] or, again, instance-level redundancy [6], [13].…”
Section: Related Workmentioning
confidence: 99%
“…For example, the authors of [15] propose a technique to identify maximal cliques in a graph where attributes are interconnected by pairwise relations, and generalize it to probabilistic cliques, where each binary relation may have a confidence associated with it. The drawbacks of combining binary relations using agglomerative algorithms or the technique used in [15] for record extraction are analyzed in [23]. The authors of [23] propose a modified approach that evaluates the compatibility of a set of attributes.…”
Section: Related Workmentioning
confidence: 99%
“…The drawbacks of combining binary relations using agglomerative algorithms or the technique used in [15] for record extraction are analyzed in [23]. The authors of [23] propose a modified approach that evaluates the compatibility of a set of attributes. Such a compatibility function is seen to achieve better accuracy in record extraction.…”
Section: Related Workmentioning
confidence: 99%