A stemming algorithm for the portuguese language

Moreira, Viviane Pereira; Huyck, Christian R.

doi:10.1109/spire.2001.989755

Cited by 71 publications

(54 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The Web site pages are pre-processed, and the existing terms in each page are stored in a database, with its stem [4]. Some treatments, such as frequency calculation, are carried through during the site process reading.…”

Section: Methodsmentioning

confidence: 99%

“…verbs in imperative form: "estava", "estou", estive" to "estar". And the terms are reduced to its stem, through the application of the "stemming" algorithm adapted for the Portuguese language [4], that performs significantly better than the Portuguese Porter algorithm version [11]. The sequence of steps is: plural reduction, feminine reduction, adverb reduction, augmentative/diminutive reduction, noun suffix reduction, verb suffix reduction, vowel removal and accents removal.…”

Section: Pre-processingmentioning

confidence: 99%

See 1 more Smart Citation

High performance environment for knowledge discovering in Portuguese language texts in the Web

Bastos¹,

Ebecken²

2006

Data Mining VII: Data, Text and Web Mining and Their Business Applications

View full text Add to dashboard Cite

This paper describes the development and implementation of a practical and efficient methodology to construct a knowledge extraction environment that contemplates the search of information from Portuguese language Web sites. The application includes some text mining facilities, such as similarity and difference identification between pages and sites, content classification and document clustering.The application conception has its origin on the evaluation environment of competitive intelligence tasks over the Web. The increasing availability of information in the Web has motivated the proposal of an environment that presents the solutions in an integrated form, supplying results analysis according to the user indication.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Pre-processingmentioning

confidence: 99%

High performance environment for knowledge discovering in Portuguese language texts in the Web

Bastos¹,

Ebecken²

2006

Data Mining VII: Data, Text and Web Mining and Their Business Applications

View full text Add to dashboard Cite

show abstract

“…O critério de seleção dos textos estabelecia que os mesmos deveriam estar presentes nas seções Brasil, Ciência, Cultura, Economia/Negócios, Educação, Espiritualidade/Religião, Esportes, Mundo/Internacional, Política, Saúde, Sociedade ou Tecnologia, de modo a originar uma coleção de objetos constituída por elementos distribuídos em doze grupos. Os documentos desta coleção foram posteriormente subdivididos em três subconjuntos e submetidos às operações de pré-processamento descritas em [42], no intuito de que passassem a ser representados em um formato estruturado, passível de manipulação por intermédio dos algoritmos de agrupamento. As principais características dos conjuntos de textos correspondentes às coleções de objetos empregadas nos experimentos de avaliação, encontram-se descritas na tabela 3 a seguir.…”

Section: Avaliação Dos íNdices De Similaridadeunclassified

Avaliação da performance de índices de similaridade aplicados ao agrupamento de objetos textuais

Neto¹,

Negreiros

2017

RBCA

View full text Add to dashboard Cite

Resumo: A captura e o armazenamento de dados em formato digital têm permitido às organizações o acúmulo de um volume de informações extremamente elevado, constituído em maior proporção por dados em formato não estruturado, representados por textos. Neste contexto, as atividades de análise de agrupamentos ou classificação não supervisionada de objetos, se constituem como uma das técnicas de mineração de informações mais frequentemente empregadas no intuito de proporcionar a organização do volume progressivamente crescente de elementos textuais, por meio da disposição dos documentos em grupos de itens semelhantes com base em um índice de similaridade. Neste sentido, este estudo avalia os índices de similaridade distância Euclidiana, distância do coseno, distância de Hamming, coeficiente de Jaccard estendido e coeficiente de correlação de Pearson, sob a perspectiva de seis índices de validação de agrupamentos, observando que a distância do coseno representa, conforme a presente análise, o índice de similaridade mais apropriado ao agrupamento de objetos textuais, convertidos em formato estruturado por intermédio de técnicas de mineração de textos.Palavras-chave: Análise de agrupamentos. Agrupamento de documentos. Índices de similaridade. Abstract:The capture and the digital data store have allowed companies the accumulation of an extremely high volume of information, constituted mainly by unstructured data, represented by texts. In this context, the cluster analysis operations or unsupervised classification of objects, represent one of the most frequently used data mining techniques to provide the organization of the progressively increasing volume of textual elements, by means of arrangement of the documents in groups of similar itens based in a similarity measure . In this sense, this article evaluate the similarity measures Euclidiean distance, cosine distance, Hamming distance, extended Jaccard coefficient and Pearson's correlation coefficient, from the perspective of six clustering validation indexes, noticing that the cosine distance represent, according to this analysis, the similarity measure most appropriate to clustering textual objects, converted into structured format through text mining techniques.Keywords: Clustering analysis. Document clustering. Similarity index 1 Introdução A mineração de dados é um processo de descoberta automática de conhecimento em grandes repositórios de dados. Correspondente a um conjunto de técnicas que atuam sobre grandes bancos de dados a fim de identificar padrões úteis que, de outra forma, permaneceriam desconhecidos. As tarefas da mineração de dados são classificadas em duas categorias principais: tarefas de previsão e tarefas descritivas. As tarefas de previsão têm como objetivo prever o conteúdo de um determinado atributo, nomeado como a variável dependente ou alvo, com base nos valores de outros atributos, conhecidos como variáveis independentes ou explicativas. Já as tarefas descritivas

show abstract

“…Step 1: the Orengo algorithm [Orengo & Huyck, 2001] is applied to remove suffixes and stop words from the keywords informed by the user; -Step 2: the base of aligned ontologies is consulted to extract all annotations and terms semantically related to the keywords provided by the user. In this step if Obaa mapping annotations are present, they are used to relate keywords to specific learning object metadata.…”

Section: Semantic Search Enginementioning

confidence: 99%