Stemming and its effects on TFIDF ranking (poster session)

Kantrowitz, Mark; Mohit, Behrang; Mittal, Vibhu O.

doi:10.1145/345508.345650

Cited by 40 publications

(28 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Stemming removes inflections (e.g., "scrolls" and "scrolling" both reduce to "scroll"). Stemming allows for a more precise comparison between bug reports by creating a more normalized corpus; our experiments used the common Porter stemming algorithm (e.g., [7]). …”

Section: Textual Analysismentioning

confidence: 99%

Automated duplicate detection for bug tracking systems

Jalbert

Weimer

2008

2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN)

216

167

View full text Add to dashboard Cite

Bug tracking systems are important tools that guide the maintenance activities of software developers. The utility of these systems is hampered by an excessive number of duplicate bug reports-in some projects as many as a quarter of all reports are duplicates. Developers must manually identify duplicate bug reports, but this identification process is time-consuming and exacerbates the already high cost of software maintenance. We propose a system that automatically classifies duplicate bug reports as they arrive to save developer time. This system uses surface features, textual semantics, and graph clustering to predict duplicate status. Using a dataset of 29,000 bug reports from the Mozilla project, we perform experiments that include a simulation of a real-time bug reporting environment. Our system is able to reduce development cost by filtering out 8% of duplicate bug reports while allowing at least one report for each real defect to reach developers.

show abstract

Section: Textual Analysismentioning

confidence: 99%

Automated duplicate detection for bug tracking systems

Jalbert

Weimer

2008

2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN)

216

167

View full text Add to dashboard Cite

show abstract

“…Since the storage efficiency is not a concern for our experiments and there is no available stopword list constructed for Ottoman language, a stopword list is not used in our framework. Stemming is another method that not only shrinks the vocabulary of the dataset, but may also increase the effectiveness of an IR environment depending on design factors such as the stemming algorithm and the language used [15]. For highly inflected languages, such as Arabic and Ottoman, developing effective stemmers is a hard task and not within the scope of this thesis.…”

Section: Typical Components Of An Ir Systemmentioning

confidence: 99%

Integrated segmentation and recognition of connected Ottoman script

et al. 2009

View full text Add to dashboard Cite

In this thesis, a novel context-sensitive segmentation and recognition method for connected letters in Ottoman script is proposed. This method first extracts a set of possible segments from a connected script and determines the candidate letters to which extracted segments are most similar. Next, a function is defined for scoring each different syntactically correct sequence of these candidate letters. To find the candidate letter sequence that maximizes the score function, a directed acyclic graph is constructed. The letters are finally recognized by computing the longest path in this graph. Experiments using a collection of printed Ottoman documents reveal that the proposed method provides very high precision and recall figures in terms of character recognition. In a further set of experiments we also demonstrate that the framework can be used as a building block for an information retrieval system for digital Ottoman archives.

show abstract

“…The effects of stemming and lemmatization as preprocessing operations of the input vector space model for LSA are controversial (see, e.g., Denhière & Lemaire, 2004;Kantrowitz, Mohit, & Mittal, 2000) and probably depend, on the one hand, on the quality of this type of preprocessing and, on the other hand, on the size of the corpora used. Stemming and lemmatization are different techniques that use language-dependent word morphology for the very same sought-after effect: Semantically similar words of the vocabulary are merged to create an equivalence class (the stem or the lemma), traditionally called the term, of the vector space model with less statistical noise; as a consequence of the merging, the vector space dimension is reduced.…”

Section: B Co-triggered Lemmatizationmentioning

confidence: 99%

Effect of tuned parameters on an LSA multiple choice questions answering model

Lifchitz

Jhean-Larose

Denhière

2009

Behavior Research Methods

View full text Add to dashboard Cite

This article presents the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of latent semantic analysis (LSA). A difficult task, which consisted of answering (French) biology multiple choice questions, was used to test the semantic properties of the truncated singular space and to study the relative influence of the main parameters. A dedicated software was designed to fine-tune the LSA semantic space for the multiple choice questions task. With optimal parameters, the performances of our simple model were quite surprisingly equal or superior to those of seventh-and eighthgrade students. This indicates that semantic spaces were quite good despite their low dimensions and the small sizes of the training data sets. In addition, we present an original entropy global weighting of the answers' terms for each of the multiple choice questions, which was necessary to achieve the model's success.

show abstract

Stemming and its effects on TFIDF ranking (poster session)

Cited by 40 publications

References 8 publications

Automated duplicate detection for bug tracking systems

Automated duplicate detection for bug tracking systems

Integrated segmentation and recognition of connected Ottoman script

Effect of tuned parameters on an LSA multiple choice questions answering model

Contact Info

Product

Resources

About