Information extraction programs (extractors) can be applied to documents to isolate structured versions of some content, that is, to create tabular records corresponding to facts found in the documents. If the data in an extracted table needs to be updated for any reason (for example, as a result of data cleaning), the source document will no longer be synchronized with the data. But documents are the principal medium for sharing information among humans. We therefore wish to ensure that changes to extracted tables are reflected correctly in their source documents.In this work, we characterize extractors for which we are able to predict the effects that updates to source documents will have on extracted records. We introduce three general properties for extractors that, if satisfied, can guarantee that consistency will be maintained if the lineage of extracted records is respected when changing the documents. We propose a property verification process that uses static analysis for a substantial subset of JAPE, a well-established rule-based extraction language, and illustrate it through an example based on a freely-available extractor library.
CCS CONCEPTS• Information systems → Information extraction; • Applied computing → Document management and text processing; • Security and privacy → Data anonymization and sanitization.