Nowadays, both public and private organizations own large private text-based data repositories with critical information. The information stored in these data silos is usually queried through information retrieval systems based on indexes, which yield hundreds or thousands of results when interrogated using keywords. In order to improve data accessibility when searching for specific information, the use of infoboxes can be very useful. The generation such infoboxes is by itself a complex problem, but in this type of isolated environments, it becomes even harder as the selection of the entities and their attributes can be conditioned by local and very specific parameters. In this work, we propose a methodology to tackle this special problem, combining classical approaches with machine learning, and leveraging the resources provided by the Semantic Web. The working methodology has been applied to two well-known datasets, and also it has been tested on a real environment scenario, showing the feasibility of our approach.
In this work, we describe the design, development, and deployment of NEREA (Named Entity Recognizer for spEcific Areas), an automatic Named Entity Recognizer and Disambiguation system, developed in collaboration with professional documentalists. The aim of NEREA is to keep accurate and current information about the entities mentioned in a local repository, and then support building appropriate infoboxes, setting out the main data of these entities. It achieves a high performance thanks to the use of classification resources belonging to the local database. With this aim, the system performs tasks of named entity recognition and disambiguation by using three types of knowledge bases: local classification resources, global databases like DBpedia, and its own catalog created by NEREA. The proposed method has been validated with two different datasets and its operation has been tested in English and Spanish. The working methodology is being applied in a real environment of a media with promising results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.