Bilingual dictionaries hold great potential as a source of lexical resources for training and testing automated systems for optical character recognition, machine translation, and cross-language information retrieval. In this paper, we describe a system for extracting term lexicons from printed bilingual dictionaries. Our work was divided into three phases -dictionary segmentation, entry tagging, and generation. In segmentation, pages are divided into logical entries based on structural features learned from selected examples. The extracted entries are associated with functional labels and passed to a tagging module which associates linguistic labels with each word or phrase in the entry. The output of the system is a structure that represents the entries from the dictionary. We have used this approach to parse a variety of dictionaries with both Latin and non-Latin alphabets, and demonstrate the results of term lexicon generation for retrieval from a collection of French news stories using English queries.Keywords: Cross-Language IR, OCR, Logical Analysis, Page Segmentation, Bilingual DictionariesThe support of this research under DARPA cooperative agreement N660010028910 and DOD contract MDA90402C0406 is gratefully acknowledged.
Report Documentation PageForm Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number.