Named Entity Recognition (NER) is an important task in many NLP pipelines. It has become especially important for knowledge bases that power many of the nowadays information retrieval systems. In order to cope with the high demand for annotated training corpora for supervised NER systems, automatic generation approaches have been proposed. In this paper we report on the first automatically generated NE annotated corpus for Albanian. News articles from Albanian news media were used as a document source. They were automatically tagged using a custom generated gazetteer from the Albanian Wikipedia. Our evaluation results show that this corpus can be used as a baseline corpus for human annotated ones or as a training corpus where no other is available.
News retrieval systems facilitate the process of quickly learning about events or stories reported in various online news providers. The traditional approach involves clustering articles that report about the same event using bag-of-words or concept based similarity measures, and offering personalized recommendations using various user modeling approaches. Knowledge bases have been extensively used in the recent years for powering search engines on entity based searches. The success of this approach, demonstrated by a now de-facto way of searching and browsing offered by commercial search engines and mobile applications, has created the need to incorporate semantic capabilities to news retrieval systems. In this paper we present a proposal for creating a knowledge base of entities, events and facts reported in Albanian online news providers. We aim to provide a news stream processing pipeline based in generally available open source toolkits and state-of-the-art research works about event and fact oriented knowledge bases.
Abstract-In the recent years there has been an increase in scientific papers publications in Albania and its neighboring countries that have large communities of Albanian speaking researchers. Many of these papers are written in Albanian. It is a very time consuming task to find papers related to the researchers' work, because there is no concrete system that facilitates this process. In this paper we present the design of a modular intelligent search system for articles written in Albanian. The main part of it is the recommender module that facilitates searching by providing relevant articles to the users (in comparison with a given one). We used a cosine similarity based heuristics that differentiates the importance of term frequencies based on their location in the article. We did not notice big differences on the recommendation results when using different combinations of the importance factors of the keywords, title, abstract and body. We got similar results when using only the title and abstract in comparison with the other combinations. Because we got fairly good results in this initial approach, we believe that similar recommender systems for documents written in Albanian can be built also in contexts not related to scientific publishing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.