Proceedings Eighth Symposium on String Processing and Information Retrieval
DOI: 10.1109/spire.2001.989755
|View full text |Cite
|
Sign up to set email alerts
|

A stemming algorithm for the portuguese language

Abstract: Stemming algorithms are traditionally used in Information Retrieval with the goal of enhancing recall, as they conflate the variant forms of a word into a common representation. This paper describes the development of a simple and eflective su&?x-stripping algorithm for Portuguese. The stemmer is evaluated using a method proposed by Paice f9/. The results show that it performs significantly better than the Portuguese version of the Porter algorithm.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0
29

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 71 publications
(54 citation statements)
references
References 7 publications
0
25
0
29
Order By: Relevance
“…The Web site pages are pre-processed, and the existing terms in each page are stored in a database, with its stem [4]. Some treatments, such as frequency calculation, are carried through during the site process reading.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The Web site pages are pre-processed, and the existing terms in each page are stored in a database, with its stem [4]. Some treatments, such as frequency calculation, are carried through during the site process reading.…”
Section: Methodsmentioning
confidence: 99%
“…verbs in imperative form: "estava", "estou", estive" to "estar". And the terms are reduced to its stem, through the application of the "stemming" algorithm adapted for the Portuguese language [4], that performs significantly better than the Portuguese Porter algorithm version [11]. The sequence of steps is: plural reduction, feminine reduction, adverb reduction, augmentative/diminutive reduction, noun suffix reduction, verb suffix reduction, vowel removal and accents removal.…”
Section: Pre-processingmentioning
confidence: 99%
“…O critério de seleção dos textos estabelecia que os mesmos deveriam estar presentes nas seções Brasil, Ciência, Cultura, Economia/Negócios, Educação, Espiritualidade/Religião, Esportes, Mundo/Internacional, Política, Saúde, Sociedade ou Tecnologia, de modo a originar uma coleção de objetos constituída por elementos distribuídos em doze grupos. Os documentos desta coleção foram posteriormente subdivididos em três subconjuntos e submetidos às operações de pré-processamento descritas em [42], no intuito de que passassem a ser representados em um formato estruturado, passível de manipulação por intermédio dos algoritmos de agrupamento. As principais características dos conjuntos de textos correspondentes às coleções de objetos empregadas nos experimentos de avaliação, encontram-se descritas na tabela 3 a seguir.…”
Section: Avaliação Dos íNdices De Similaridadeunclassified
“…Step 1: the Orengo algorithm [Orengo & Huyck, 2001] is applied to remove suffixes and stop words from the keywords informed by the user; -Step 2: the base of aligned ontologies is consulted to extract all annotations and terms semantically related to the keywords provided by the user. In this step if Obaa mapping annotations are present, they are used to relate keywords to specific learning object metadata.…”
Section: Semantic Search Enginementioning
confidence: 99%