Valeriana G. Roncero scite author profile

Ebecken

2009

The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Text mining is the process of extracting interesting information and knowledge from unstructured text. One key difficulty with text classification learning algorithms is that they require many hand-labeled documents to learn accurately. In the text mining pattern discovery phase, the text classification step aims to automatically attribute one or more pre-defined classes to text documents. In this research, we propose to use an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naïve Bayes classifier on a grid environment, this combination is based on a mixture of multinomials, which is commonly used in text classification. Naïve Bayes is a probabilistic approach to inductive learning. It estimates the a posteriori probability that a document belongs to a class given the observed feature values of the document, assuming independence of the features. The class with the maximum a posteriori probability is assigned to the document. EM is a class of iterative algorithms for maximum likelihood or maximum a posteriori estimation in problems with unlabeled data. The grid environment is a geographically distributed computation infrastructure composed of a set of heterogeneous resources. Text classification mining methods are time-consuming, but using the grid infrastructure can bring significant benefits in the learning and classification process.

show abstract

Text Mining Grid Services for Multiple Environments

Serpa

et al. 2008

Abstract. The objective of this paper is to describe the implementation of text mining grid services for Aîuri Project, which is a framework that includes a friendly user interface, data and text mining tasks, database access and a visualization tool integrated with various grid environments. The focus is the development and test of components for analysis and evaluation of unstructured data into distinct grid environments. These components will be grid services for text mining processes using several approaches of execution, depending on which grid environment the user choose to submit.All components are open source and are freely available to the scientific community, providing access to existing services as well as encouraging the addition of new ones.

show abstract

Using Stemming Algorithms on a Grid Environment

Ebecken

2008

Text Classification on a Grid Environment

Ebecken

2011