A Stemming Algorithm for Latin Text Databases

Schinke, Robyn; Greengrass, Mark; Robertson, Anne M.; Willett, Peter

doi:10.1108/eb026966

Cited by 28 publications

(17 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…On the one hand, there has been generally less IR work done in these languages and on the other hand, the application of stemming algorithms requires the implementation of considerable linguistic knowledge, which is not always available. In any case, it is possible to find proposals and algorithms for specific languages, among which are Latin itself, despite its being a dead language [7], Malay [8], French [9], [10] or Arabic [11].…”

mentioning

confidence: 99%

Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval

Figuerola

Román

2000

Journal of Information Science

View full text Add to dashboard Cite

Abstract:At some stage, most of the models and techniques implemented in IR use frequency counts of the terms appearing in documents and in queries.However, many words, since they are derived from the same stem, have very close semantic contents. This makes a grouping of such variants under a single term advisable. Otherwise, dispersal occurs in the calculation of frequency of these terms, and it also becomes difficult to compare queries and documents. On the other hand, there are notable differences between different languages in the way of forming derivatives and inflected forms, so that the application of specific techniques can produce unequal results according to the language of the documents and queries. A description is given of the tests carried out for documents in Spanish, which involved some stemming techniques widely used in English, as well as the application of n-grams, and the results are compared.

show abstract

mentioning

confidence: 99%

Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval

Figuerola

Román

2000

Journal of Information Science

View full text Add to dashboard Cite

show abstract

“…We have also implemented a manual approach in read_ontology that inputs a user‐defined ontology as a text file. To link characters to terms in the ontology, we first simplify terms using the Schinke algorithm (Schinke, Greengrass, Robertson, & Willett, ), useful for Latin terms common in anatomical datasets (e.g. ‘humerus’ becomes ‘humer’).…”

Section: Overview Of the Phenotools Packagementioning

confidence: 99%

phenotools: An r package for visualizing and analysing phenomic datasets

2019

View full text Add to dashboard Cite

Phenotypic data are crucial for understanding genotype–phenotype relationships, assessing the tree of life and revealing trends in trait diversity over time. Large‐scale description of whole organisms for quantitative analyses (phenomics) presents several challenges, and technological advances in the collection of genomic data outpace those for phenomic data. Reasons for this disparity include the time‐consuming and expensive nature of collecting discrete phenotypic data and mining previously published data on a given species (both often requiring anatomical expertise across taxa), and computational challenges involved with analysing high‐dimensional datasets. One approach to building approximations of organismal phenomes is to combine published datasets of discrete characters assembled for phylogenetic analyses into a phenomic dataset. Despite a wealth of legacy datasets in the literature for many groups, relatively few methods exist for automating the assembly, analysis, and visualization of phenomic datasets in phylogenetic contexts. Here, we introduce a new r package phenotools for integrating (fusing original or legacy datasets), curating (finding and removing duplicates) and visualizing phenomic datasets. We demonstrate the utility of the proposed toolkit with a morphological dataset for flightless birds and two morphological datasets for theropod dinosaurs and provide recommendations for character construction to maximize accessibility in future workflows. Visualization tools allow rapid identification of anatomical subregions with difficult or problematic histories of homology. We anticipate these tools aiding automation of the assembly and visualization of phenomic datasets to inform evolutionary relationships and rates of phenotypic evolution.

show abstract

“…Even though the Longest-match approach requires the compilation of all possible combinations of suffixes; it has less computational complexity because the arrangement of suffixes in suffix list are in their decreasing order of length and has less time complexity because it involves in single pass of the suffix match. In addition, longest-match approach is often easier to program [5,10].…”

Section: Stemming Approachmentioning

confidence: 99%

Development of Longest-Match Based Stemmer for Texts of Wolaita Language

Bade¹

2018

IJDST

View full text Add to dashboard Cite

This research presents design, experiment and development of longest-match based Stemmer for Wolaita texts.The objective of this paper is to conflate the variants of Wolaita text words into its stem with better accuracy, using Longest-Match based approach. To help the researcher how to compile the possible combination of suffixes, the deep analysis of Wolaita word morphology has been made. For data preprocess and implementation, C# programming language is used. After preprocessing, 12789 unique words are reserved to experiment this research. Out of these unique words, 1200 words are randomly selected earlier and kept separate for testing purpose. Then the developed stemmer was tested using Paice's actual error counting method. The output on that test dataset has showed 91.84% accuracy over actual manually stemmed words. The obtained result shows that the rule based longest match approach is promising for stemming Wolaita language texts.

show abstract

A Stemming Algorithm for Latin Text Databases

Cited by 28 publications

References 14 publications

Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval

Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval

phenotools: An r package for visualizing and analysing phenomic datasets

Development of Longest-Match Based Stemmer for Texts of Wolaita Language

Contact Info

Product

Resources

About