Abstract.A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically.WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semi-structured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories.Keywords: natural language processing, information extraction, rule learning
Information extractionAs more and more text becomes available on-line, there is a growing need for systems that extract information automatically from text data. An information extraction (IE) system can serve as a front end for high precision information retrieval or text routing, as a first step in knowledge discovery systems that look for trends in massive amounts of text data, or as input to an intelligent agent whose actions depend on understanding the content of text-based information.IE systems have been developed for writing styles ranging from structured text with tabular information to free text such as news stories. A key element of such systems is a set of text extraction rules that identify relevant information to be extracted.For structured text, the rules specify a fixed order of relevant information and the labels or HTML tags that delimit strings to be extracted. For free text, an IE system needs several steps in addition to text extraction rules. These include syntactic analysis, semantic tagging, recognizers for domain objects such as person and company names, and discourse processing that makes inferences across sentence boundaries. Extraction rules for free text are typically based on patterns involving syntactic relations between words or semantic classes of words.
Semi-structured textA useful class of text that falls between these extremes has been largely inaccessible to IE systems. Such semi-structured text 1 is ungrammatical and often telegraphic in style, but does