The growing availability of on-line textual sources and the potential number of applications of knowledge acquisition from textual data has lead to an increase in Information Extraction (IE) research. Some examples of these applications are the generation of data bases from documents, as well as the acquisition of knowledge useful for emerging technologies like question answering, information integration, and others related to text mining. However, one of the main drawbacks of the application of IE refers to its intrinsic domain dependence. For the sake of reducing the high cost of manually adapting IE applications to new domains, experiments with different Machine Learning (ML) techniques have been carried out by the research community. This survey describes and compares the main approaches to IE and the different ML techniques used to achieve Adaptive IE technology.
Curricula designed in the context of the European Higher Education Area need to be based on both domain-specific and professional competencies. Whereas universities have had extensive experience in developing students' domain-specific competencies, fostering professional competencies poses a new challenge we need to face. This paper presents a model to globally develop professional competencies in a STEM degree program, and assesses the results of its implementation after four years. The model is based on the use of competency maps, in which each competency is defined in terms of competency units. Each competency unit is described by their expected learning outcomes at three domain levels. This model allows careful analysis, revision and iteration for an effective integration of professional competencies in domain-specific subjects. A global competency map is also designed, including all the professional-competency learning outcomes to be achieved throughout the degree. This map becomes a useful tool for curriculum designers and coordinators. The results were obtained from four sources: 1) students' grades (classes graduated from 2013 to 2016, the first four years from the new Bachelor's Degree in Informatics Engineering at the Barcelona School of Informatics); 2) students' surveys (answered by students when they finished the degree); 3) the government employment survey, where former students evaluate the satisfaction of the received training in the light of their work experience; and 4) the Everis Foundation University-Enterprise Ranking, answered by over 2000 employers evaluating their satisfaction regarding their employees' university training, where the Barcelona School of Informatics scores first in the national ranking. The results show that competency maps are a good tool for developing professional competencies in a STEM degree.
We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components. Categories and Subject Descriptors: H.3.3: ClusteringGeneral Terms: Algorithms Keywords: Unsupervised clustering, EM initialization MOTIVATION AND BACKGROUNDThe work presented in this paper is motivated by research into text mining and classification from large, real-world document collections. As the amount of available data becomes virtually unlimited, manual or supervised mining approaches become prohibitively expensive due to the limited reading and processing speed of the human experts. For this reason, we concentrate our research only on unsupervised methods. From the larger field of text mining and classification, this paper focuses on document clustering. Clustering, loosely defined as the grouping of similar data items, is the keystone of data classification. Following our creed, we focus on unsupervised clustering techniques that do not require labeled data or human feedback.From the vast array of clustering methods, iterative refinement clustering techniques are extremely popular due to their good performance, relative simplicity, and good theoretical foundations. By and large the most popular iterative refinement clustering algorithm is Expectation MaxPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. imization (EM) [2]. EM iteratively: (a) assigns membership probabilities for all data items and all clusters, and (b) re-estimates its model parameters based on the new assignments.The EM class of clustering algorithms are not problem free. Like all clustering algorithms, they rely on outside sources to provide the expected number of clusters, k. Having the human domain expert provide this information is not feasible when dealing with large document collections containing new, potentially unknown data. Hence, we focus only on automated, unsupervised methods for the estimation of k. The most popular probabilistic method to determine the dimensions of a given model is the Bayes Information Criterion (BIC) [9]. From all possible mode...
Abstract. This paper describes GeoTALP-IR system, a Geographical Information Retrieval (GIR) system. The system is described and evaluated in the context of our participation in the CLEF 2005 GeoCLEF Monolingual English task.The GIR system is based on Lucene and uses a modified version of the Passage Retrieval module of the TALP Question Answering (QA) system presented at CLEF 2004 and TREC 2004 QA evaluation tasks. We designed a Keyword Selection algorithm based on a Linguistic and Geographical Analysis of the topics. A Geographical Thesaurus (GT) has been built using a set of publicly available Geographical Gazetteers and a Geographical Ontology. Our experiments show that the use of a Geographical Thesaurus for Geographical Indexing and Retrieval has improved the performance of our GIR system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.