Luís Miguel Cabral scite author profile

Costa

2008

In this paper we describe SUPeRB, a digital librarian helper, which has two specific goals: update and maintain specific publication repositories; and assist in the publishing of publication records, for institutions and individual actors. It does this by gathering bibliographic data from web pages and documents in order to build a local repository of bibliographic data on a specific subject. Also, by collecting information from these resources, SUPeRB assists in building a bibliographic database with specific domain intervenients such as authors, conferences and scientific journals. ResumoEste artigo descreve o SUPeRB, um sistema para procurar e tratar referências bibliográficas na Web, que possui dois objectivos: actualizar e manter repositórios de publicações numaárea específica; e assistir na publicação de dados bibliográficos de instituições ou de investigadores individuais. Para tal, o SUPeRB recolhe informação de páginas e documentos electrónicos, construindo um repositório local de referências bibliográficas daárea. Ao construir estes recursos, o SUPeRB cria ainda uma base de conhecimento num determinado domínio, contendo autores, conferências e revistas científicas. Categories and Subject Descriptors General TermsExtracção de informação, Gestão de informação, Referências Bibliográficas INTRODUÇÃODesde 1999 que a Linguateca disponibiliza um portal dedicado ao processamento computacional do português com o objectivo de fornecer uma boa panorâmica a todos os interessados nestaárea. O nosso objectivo foi desde o início garantir a existência de um local que permita aos investigadores e programadores seguirem o trabalho feito nestaárea, de forma a evitar repetição de esforços e potenciando, ao invés, colaborações entre diferentes instituições realizando esforços complementares. Um dos recursos que mantemosé um catálogo de publicações relacionadas com o processamento computacional do português. Entre 1999 e 2003, recolhemos manualmente cerca de 750 entradas, incluindo, quando disponíveis, as suas versões electrónicas. Embora a nossa equipa acompanhe as listas de discussão e de artigos aceites em conferências relevantes para aárea, chegá-mosà conclusão que não era fácil manter este recurso actualizado.É particularmente difícil encontrar a informação completa sobre artigos e outras publicações científicas, dado que muitos investigadores não actualizam as suas páginas de publicações frequentemente. Para além disto,é comum encontrarmos outras dificuldades para obter e processar esta informação, tais como:• Referências incompletas, onde se omitem por exemplo os nomes completos das conferências, os editores dos volumes, as edições das conferências ou a sua localizacão;• Vários estilos bibliográficos usam as iniciais dos autores, o que complica a tarefa de os identificar automaticamente;• As versões electrónicas não são exactamente iguaisàs versões publicadas (pelo menos no que diz respeitoà formatação).E também de referir que quase nenhum dos autores com trabalhos no nosso catálogo usa meta-informação ou qu...

REPENTINO – A Wide-Scope Gazetteer for Entity Recognition in Portuguese

Sarmento

Pinto²,

Cabral³

2006

Abstract. In this paper we describe REPENTINO, a publicly available gazetteer intended to help the development of named entity recognition systems for Portuguese. REPENTINO wishes to minimize the problems developers face due to the limited availability of this type of lexical-semantic resources for Portuguese. The data stored in REPENTINO was mostly extracted from corpora and from the web using simple semi-automated methods. Currently, REPENTINO stores nearly 450k instances of named entities divided in more than 100 categories and subcategories covering a much wider set of domains than those usually included in traditional gazetteers. We will present some figures regarding the current content of the gazetteer and describe future work regarding the evaluation of this resource and its enrichment with additional information.

GikiCLEF: Expectations and Lessons Learned

2010

Abstract. This overview paper is devoted to a critical assessment of GikiCLEF 2009, an evaluation contest specifically designed to expose and investigate cultural and linguistic issues in Wikipedia search, with eight participant systems and 17 runs. After providing a maximally short but self contained overview of the GikiCLEF task and participation, we present the open source SIGA system, and discuss, for each of the main guiding ideas, the resulting successes or shortcomings, concluding with further work and still unanswered questions. MotivationOne of the reasons to propose and organize GikiCLEF (and the previous GikiP pilot [1]) was our concern that CLEF did not in general propose realistic enough tasks, especially in matters dealing with crosslingual and multilingual issues, both in topic/question creation and in the setups provided. In other words, while sophisticated from many points of view, CLEF setup was deficient in the attention paid to language differences (see e.g. While we all know in IR evaluation that laboratory testing has to be different from real life, and that a few topics or choices are not possible to validate a priori, but have to be studied after enough runs have been submitted and with respect to the pools and systems that were gathered 1 , we wanted nevertheless to go some steps further, attempting to satisfy the following desiderata. GikiCLEF thus should:1. provide a marriage of information needs and information source with real-life anchoring: and it is true that the man in the street does go to Wikipedia in many languages to satisfy his information needs; 2. tackle questions difficult both for a human being and for a machine: basically, we wanted a task with real usefulness, and not a task which would challenge systems to do what people don't want them to do. On the other hand, we wanted of course tasks that were possible to assess by (and satisfy) people, and not tasks that only computers could evaluate; 3. implement a context where different languages should contribute different answers, so that it would pay to look in many languages in parallel; 4. present a task that fostered the deployment of multilingual (and monolingual) systems that made use of comparable corpora.1 In fact, although this has been done for TREC -see [5,6] -it still remains to be done for CLIR or MLIA, although GridCLEF [7] is a significant step in this direction.

What Happened to Esfinge in 2007?

Costa

How geographic was GikiCLEF?

Cardoso

2010