Provenance-based dictionary refinement in information extraction

Roy, Sudeepa; Chiticariu, Laura; Feldman, Vitaly; Reiss, Frederick; Zhu, Huaiyu

doi:10.1145/2463676.2465284

Cited by 8 publications

(1 citation statement)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Provenance-based techniques have also been applied to information extraction problems. Roy et al [15] propose a provenance-based technique to improve the quality of extraction by refining the dictionaries that are used in a rule-based extraction system. A set of entries from the dictionaries that have been involved in generating the output are analyzed to determine which should be removed to improve the extractor's performance most.…”

Section: Related Workmentioning

confidence: 99%

Predictable and Consistent Information Extraction

Kassaie

Tompa

2019

Proceedings of the ACM Symposium on Document Engineering 2019

View full text Add to dashboard Cite

Information extraction programs (extractors) can be applied to documents to isolate structured versions of some content, that is, to create tabular records corresponding to facts found in the documents. If the data in an extracted table needs to be updated for any reason (for example, as a result of data cleaning), the source document will no longer be synchronized with the data. But documents are the principal medium for sharing information among humans. We therefore wish to ensure that changes to extracted tables are reflected correctly in their source documents.In this work, we characterize extractors for which we are able to predict the effects that updates to source documents will have on extracted records. We introduce three general properties for extractors that, if satisfied, can guarantee that consistency will be maintained if the lineage of extracted records is respected when changing the documents. We propose a property verification process that uses static analysis for a substantial subset of JAPE, a well-established rule-based extraction language, and illustrate it through an example based on a freely-available extractor library. CCS CONCEPTS• Information systems → Information extraction; • Applied computing → Document management and text processing; • Security and privacy → Data anonymization and sanitization.

show abstract

Section: Related Workmentioning

confidence: 99%