Learning field compatibilities to extract database records from unstructured text

Wick, Michael; Culotta, Aron; McCallum, Andrew

doi:10.3115/1610075.1610160

Cited by 18 publications

(11 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cross-sentence relation extraction Several relation extraction tasks have benefited from crosssentence extraction, including MUC fact and event extraction (Swampillai and Stevenson, 2011), record extraction from web pages (Wick et al, 2006), extraction of facts for biomedical domains (Yoshikawa et al, 2011), and extensions of semantic role labeling to cover implicit inter-sentential arguments (Gerber and Chai, 2010). These prior works have either relied on explicit co-reference annotation, or on the assumption that the whole document refers to a single coherent event, to simplify the problem and reduce the need for powerful representations of multi-sentential contexts of entity mentions.…”

Section: Binary Relation Extractionmentioning

confidence: 99%

Cross-Sentence N-ary Relation Extraction with Graph LSTMs

Peng

Poon

Quirk

et al. 2017

TACL

446

382

View full text Add to dashboard Cite

show abstract

Section: Binary Relation Extractionmentioning

confidence: 99%

Cross-Sentence N-ary Relation Extraction with Graph LSTMs

Peng

Poon

Quirk

et al. 2017

TACL

446

382

View full text Add to dashboard Cite

show abstract

“…Being inspired by tagging problems common in bio-informatics and other areas, these approaches traditionally require some form of supervision. Many require an initial seed of correctly segmented records [10], [21], [23], [26], [37], while others require positive and negative examples of valid field/column values as training data [24], [32], sometimes leveraging existing knowledge bases [9], [30] or, again, instance-level redundancy [6], [13].…”

Section: Related Workmentioning

confidence: 99%

Joint repairs for web wrappers

Ortona

Orsi

Furche

et al. 2016

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.

show abstract

“…For example, the authors of [15] propose a technique to identify maximal cliques in a graph where attributes are interconnected by pairwise relations, and generalize it to probabilistic cliques, where each binary relation may have a confidence associated with it. The drawbacks of combining binary relations using agglomerative algorithms or the technique used in [15] for record extraction are analyzed in [23]. The authors of [23] propose a modified approach that evaluates the compatibility of a set of attributes.…”

Section: Related Workmentioning

confidence: 99%

“…The drawbacks of combining binary relations using agglomerative algorithms or the technique used in [15] for record extraction are analyzed in [23]. The authors of [23] propose a modified approach that evaluates the compatibility of a set of attributes. Such a compatibility function is seen to achieve better accuracy in record extraction.…”

Section: Related Workmentioning

confidence: 99%

Exploiting evidence from unstructured data to enhance master data management

et al. 2012

View full text Add to dashboard Cite

Master data management (MDM) integrates data from multiple structured data sources and builds a consolidated 360-degree view of business entities such as customers and products. Today's MDM systems are not prepared to integrate information from unstructured data sources, such as news reports, emails, call-center transcripts, and chat logs. However, those unstructured data sources may contain valuable information about the same entities known to MDM from the structured data sources. Integrating information from unstructured data into MDM is challenging as textual references to existing MDM entities are often incomplete and imprecise and the additional entity information extracted from text should not impact the trustworthiness of MDM data.In this paper, we present an architecture for making MDM text-aware and showcase its implementation as IBM InfoSphere MDM Extension for Unstructured Text Correlation, an add-on to IBM InfoSphere Master Data Management Standard Edition. We highlight how MDM benefits from additional evidence found in documents when doing entity resolution and relationship discovery. We experimentally demonstrate the feasibility of integrating information from unstructured data sources into MDM.

show abstract

Learning field compatibilities to extract database records from unstructured text

Cited by 18 publications

References 16 publications

Cross-Sentence N-ary Relation Extraction with Graph LSTMs

Cross-Sentence N-ary Relation Extraction with Graph LSTMs

Joint repairs for web wrappers

Exploiting evidence from unstructured data to enhance master data management

Contact Info

Product

Resources

About