* endorsed by SIGNLL-ACL's Special Interest Group on Natural Language Learning * endorsed by SIGANN-ACL's Special Interest Group for Annotation * Extended versions of the best papers will be chosen for a special issue of the Decision Support Systems journal (published by Elsevier).
We are presenting a working system for automated news analysis that ingests an average total of 7600 news articles per day in five languages. For each language, the system detects the major news stories of the day using a group-average unsupervised agglomerative clustering process. It also tracks, for each cluster, related groups of articles published over the previous seven days, using a cosine of weighted terms. The system furthermore tracks related news across languages, in all language pairs involved. The cross-lingual news cluster similarity is based on a linear combination of three types of input: (a) cognates, (b) automatically detected references to geographical place names and (c) the results of a mapping process onto a multilingual classification system. A manual evaluation showed that the system produces good results.
Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines -most of all: extreme simplicity -can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications (http://press.jrc.it/overview.html). EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that -to achieve this -the most needed resources would be freely available, simple, parallel and uniform multilingual dictionaries, corpora and software tools.The share of non-English documents on the internet is rising continuously. While many private users will only be interested in finding monolingual information in their own language, the need for multilingual information retrieval, information extraction and cross-lingual information access for professionals, organisations and businesses is rising steadily. Starting from the premise that we need multilingual text mining tools, the question we would like to ask here is: How can we avoid that the development of (any) text mining application for N languages takes N times the effort of developing them for one language. It is generally acknowledged that developers benefit from the experience of having produced tools in one or more languages before, and that the existence of an efficient implementation infrastructure is extremely important (e.g. Maynard et al. 2002). Such software building blocks can include, for instance, a grammar implementation formalism, tools for marking up text, debugging tools, automatic evaluation tools and procedures, etc. Furthermore, simple applications like sentence splitters are typically so similar for different languages that -once one exists -the same tool is usually quickly adapted to new languages. We will thus try to take the effort of developing the infrastructure out of the equation. The question should thus be reformulated:Assuming that you have already developed text mining applications for some languages, ho...
We present a tool that extracts person names from multilingual news collections and matches name variants referring to the same person. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 15,000 news articles per day to detect related news within the same and across different languages.After giving some background on name transliteration and referring to related work (Section Background and related work), we describe tools to identify names in text (Section Proper name recognition) and the mechanism to merge name variants, including those written in Cyrillic, Arabic, and Greek script (Section Detecting and merging name variants). This is followed by evaluation results (Section Evaluation). Table 1: Overview of a recognised person name in nine languages, showing various orthographies for the same person. The words in italics show the recognised trigger word(s).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.