This paper studies the e ective use of information retrieval and machine learning techniques in a new task, event detection and tracking. The objective is to automatically detect novel events from chronologically-ordered streams of news stories, and track events of interest over time. We extended existing supervised learning and unsupervised clustering algorithms to allow document classi cation based on both information content and temporal aspects of events. A task-oriented evaluation was conducted using Reuters and CNN news stories. We found agglomerative document clustering highly e ective (82% in the F 1 measure) for retrospective event detection, and single-pass clustering with time windowing a better choice for on-line alerting of novel events. We also observed robust learning behavior for k-nearest neighbor (kNN) classi cation and a decision-tree approach in event tracking, under the di cult condition when the number of positive training examples is extremely small.
The Pangloss Example-Based Machine Translation engine (I'anEI3MT) l is a translation system reql,iring essentially no knowledge of the structure of a language, merely a large parallel corpus of example sentences atn[ a bilingual dictionary. Input texts are segmented into sequences of words occurring in the corpus, for which translations are determined by subsententia[ alignment of the sentence pairs containing those sequences. These partial translations are then combined with the results of other translation en gines to form the final translation produced by the Pangloss system. In an internal evaluation, PanEBMT achieved 70.2% coverage of unrestricted Spanish news-wire text, despite a simplistic subsententia[ alignment algorithm, a subop ritual dictionary, and a corpus Dora a different domain than the evalual, ion texts.
l~t:ovious work has shown thai adding genera.liza.tion of the exa.ml)les in the corpus of a.n exa.ml)le-1)ased machine tra.nsla.tion (I'31LMT) system ea, n reduce 1;he re(ltfire.d amount o[' pre
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.