2000
DOI: 10.1177/016555150002600610
|View full text |Cite
|
Sign up to set email alerts
|

Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval

Abstract: Abstract:At some stage, most of the models and techniques implemented in IR use frequency counts of the terms appearing in documents and in queries.However, many words, since they are derived from the same stem, have very close semantic contents. This makes a grouping of such variants under a single term advisable. Otherwise, dispersal occurs in the calculation of frequency of these terms, and it also becomes difficult to compare queries and documents. On the other hand, there are notable differences between d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
8
0
2

Year Published

2002
2002
2016
2016

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(11 citation statements)
references
References 18 publications
1
8
0
2
Order By: Relevance
“…It is difficult to know if this improvement is due to a more accurate extraction of pairs or due to differences between Spanish and English constructions. An important characteristic of the CLEF collection that can have a considerable impact on the performance of linguistically motivated indexing techniques is the large number of typographical errors present in documents, as have been reported in [8]. In particular, titles of the news (documents) are in capital letters without accents.…”
Section: Resultsmentioning
confidence: 99%
“…It is difficult to know if this improvement is due to a more accurate extraction of pairs or due to differences between Spanish and English constructions. An important characteristic of the CLEF collection that can have a considerable impact on the performance of linguistically motivated indexing techniques is the large number of typographical errors present in documents, as have been reported in [8]. In particular, titles of the news (documents) are in capital letters without accents.…”
Section: Resultsmentioning
confidence: 99%
“…En [21] y [22] el lector puede hallar un estado del arte muy completo al respecto. Ejemplos de este tipo de análisis son la comparación de grafos [20], la utilización de n-gramas [16,34], la búsqueda de analogías [32], los modelos superficiales a base de reglas [38,31], los modelos probabilísticos [12], la segmentación por optimización [11,19], el aprendizaje no supervisado de las familias morfológicas por clasificación jerárquica ascendente [7], la lematización usando distancias de Levenshtein [14] o la identificación de sufijos por medio de la entropía [42]. Estos métodos se distinguen por el tipo de resultados obtenidos, ya sea la identificación de lemas, stems o sufijos.…”
Section: Algoritmos De Stemming Y De Lematizaciónunclassified
“…We have also built a Lemmatizer to apply to documents from the collection object of our study, based on the works of Porter and other more specific to the Spanish language [10].…”
Section: Stemmingmentioning
confidence: 99%