Automatic rule refinement for information extraction

Liu, Bin; Chiticariu, Laura; Chu, Vivian S.; Jagadish, H. V.; Reiss, Frederick

doi:10.14778/1920841.1920916

Cited by 31 publications

(41 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, view update naturally arises when debugging Information Extraction (IE) programs, which can be highly complicated [23]. As a concrete example, the MIDAS system [1] extracts basic relations from multiple (publicly available) financial data sources, some of which are semistructured or just text, and integrates them into composite entities, events and relationships.…”

Section: Introductionmentioning

confidence: 99%

“…When the integration query is taken as the view definition, deletion propagation becomes the task of suggesting tuples to be deleted from the base relations for eliminating the erroneous conclusion, while minimizing the effect on the remaining conclusions. Furthermore, eliminating tuples from the base relations may itself entail deletion propagation, since these tuples are typically extracted by consulting external (possibly unclean) data sources [23,25].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Maximizing conjunctive views in deletion propagation

Kimelfeld

Vondrák

Williams

2011

Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems

View full text Add to dashboard Cite

In deletion propagation, tuples from the database are deleted in order to reflect the deletion of a tuple from the view. Such an operation may result in the (often necessary) deletion of additional tuples from the view, besides the intentionally deleted one. The complexity of deletion propagation is studied, where the view is defined by a conjunctive query (CQ), and the goal is to maximize the number of tuples that remain in the view. Buneman et al. showed that for some simple CQs, this problem can be solved by a trivial algorithm. This paper identifies additional cases of CQs where the trivial algorithm succeeds, and in contrast, it proves that for some other CQs the problem is NP-hard to approximate better than some constant ratio. In fact, this paper shows that among the CQs without self joins, the hard CQs are exactly the ones that the trivial algorithm fails on. In other words, for every CQ without self joins, deletion propagation is either APX-hard or solvable by the trivial algorithm.The paper then presents approximation algorithms for certain CQs where deletion propagation is APX-hard. Specifically, two constant-ratio (and polynomial-time) approximation algorithms are given for the class of star CQs without self joins. The first algorithm is a greedy algorithm, and the second is based on randomized rounding of a linear program. While the first algorithm is more efficient, the second one has a better approximation ratio. Furthermore, the second algorithm can be extended to a significant generalization of star CQs. Finally, the paper shows that self joins can have a major negative effect on the approximability of the problem.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Maximizing conjunctive views in deletion propagation

Kimelfeld

Vondrák

Williams

2011

Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems

View full text Add to dashboard Cite

show abstract

“…Approaches for refining rule-based information extraction programs have been recently proposed in [34,7,26]. Shen et al [34] propose an approach for refining rules by posing a series of template questions to the user, where each question asks for additional information about a specific (predefined) feature of the desired extracted data, whereas Chai et al [7] allow users to update any (incorrect) intermediate result derived by the system and proposes techniques for incorporating these updates during program execution.…”

Section: Related Workmentioning

confidence: 99%

“…In contrast, we develop techniques to automatically compute a (small) set of dictionary entries, therefore allowing the user to focus on a (small) set of base tuples whose removal results in highest quality improvements for the extractor. Liu et al [26] proposed a provenance-based framework for refining information extraction rules. They showed how to use provenance to compute high-level changes, a specific intermediate result whose removal from the output of an operator causes the removal of a false positive from the result, and how multiple high-level changes can be realized via a low-level change: a concrete change to the operator that removes one or more intermediate results from the output of the operator.…”

Section: Related Workmentioning

confidence: 99%

“…Maintaining complex rules in general is a labor intensive process requiring expertise in the system's rule language. Building high precision dictionaries, especially for domain-specific applications, provides a low-overhead option requiring little or no knowledge of the system, and helps create, maintain and update extractor rules to further improve the quality of the system [26,9,17,28].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Provenance-based dictionary refinement in information extraction

Roy

Chiticariu

Feldman

et al. 2013

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results.In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

show abstract