In a previous question answering study, we identified nine semantic-relationship types, including synonyms, hypernyms, word chains, and holonyms, that exist between terms in Text Retrieval Conference queries and those in their supporting sentences in the Advanced Question Answering for Intelligence (Graff, 2002) corpus. The most frequently occurring relationship type was the hypernym (e.g., Katherine Hepburn is an actress).The aim of the present work, therefore, was to develop a method for determining a person's occupation from syntactic data in a text corpus. First, in the P -System, we compared predicate-argument data involving a proper name for different occupations using Okapi's BM25 weighting algorithm. When classifying actors and using sufficiently frequent names, an accuracy of 0.955 was attained. For evaluation purposes, we also implemented a standard apposition-based classifier (A-System). This performs well, but only if a particular name happens to appear in apposition with the corresponding occupation. Last, we created a hybrid (H -System) which combines the strengths of P with those of A. Using data with a minimum of 100 predicate-argument pairs, H performed best with an overall lenient accuracy of 0.750 while A and P scored 0.615 and 0.656, respectively. We therefore conclude that a hybrid approach combining information from different sources is the best way to predict occupations.
IntroductionA question answering (QA) system takes as input a short query and returns an exact answer extracted from a document collection. As part of our participation in the annual Text Retrieval Conference (TREC) and Cross Language Evaluation Forum (CLEF) QA evaluations, we developed the Documents and Linguistic Technology (DLT) system Sutcliffe, White, Slattery, Gabbay, & Mulcahy, 2006). When presented with a query, it applied the method established by many other firstgeneration QA models of identifying the appropriate Named Entity (NE) type needed to answer the question, and then with a scoring function selecting an NE of this type from a set of topical documents as determined by an information retrieval (IR) system. For example, the query "How long is a quarter in an NBA game?" would be answered with an instance of the length_of_time NE type. A Boolean IR system first returned a collection of documents deemed relevant to a modified form of the original query. From these, all recognizable length_of_time NEs were scored by a function, and the NE with the highest score was returned as the answer.The development of the DLT system spawned the investigation in White and Sutcliffe (2004), where we considered the possibility of locating supporting sentences by identifying terms that are related to those in the query. We enumerated occurrences of various morphological relationship types, including direct matches, different inflections, different Partsof-Speech (POS), and various semantic-relationship types such as synonyms, hypernyms, word chains, and holonyms, that exist between terms in 50 TREC factoid queries from 2003 and...