A probabilistic learning approach for document indexing

Fuhr, Norbert; Buckley, Chris

doi:10.1145/125187.125189

Cited by 146 publications

(102 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A wide variety of learning approaches have been applied to TC, to name a few, Bayesian classification (Lewis and Ringuette 1994;Domingo and Pazzani 1996;Larkey and Croft 1996;Koller and Sahami 1997;Lewis 1998), decision trees (Weiss, Apte et al ;Fuhr and Buckley 1991;Cohen and Hirsh 1998;Li and Jain 1998), decision rule classifiers such as CHARADE (Moulinier and Ganascia 1996), or DL-ESC (Li and Yamanishi 1999), or RIPPER (Cohen and Hirsh 1998), or SCAR (Moulinier, Raskinis et al 1996), or SCAP-1 (Apté, Damerau et al 1994), multi-linear regression models (Yang and Chute 1994;Yang and Liu 1999), Rocchio method (Hull 1994;Ittner, Lewis et al 1995;Sable and Hatzivassiloglou 2000), Neural Networks (Schütze, Hull et al 1995;Wiener, Pedersen et al 1995;Dagan, Karov et al 1997;Ng, Goh et al 1997;Lam and Lee 1999;Ruiz and Srinivasan 1999), example based classifiers (Creecy 1991;Masand, Linoff et al 1992;Larkey 1999), support vector machines (Joachims 1998), Bayesian inference networks (Tzeras and Hartmann 1993;Wai and Fan 1997;Dumais, Platt et al 1998), genetic algorithms (Masand 1994;Clack, Farringdon et al 1997), and maximum entropy modelling (Manning and Schütze 1999).…”

Section: Machine Learning Approaches To Text Categorizationmentioning

confidence: 99%

“…combining many words as one index, for example "artificial intelligence" or "data mining" (Fuhr and Buckley 1991;Tzeras and Hartmann 1993;Schütze, Hull et al 1995). These indexes can be generated either manually or automatically.…”

Section: Indexingmentioning

confidence: 99%

“…Among these are the DIA association factor (Fuhr and Buckley 1991), chi-square (Yang and Pedersen 1997;Sebastiani, Sperduti et al 2000;Caropreso, Matwin et al 2001), NGL coefficient (Ng, Goh et al 1997;Ruiz and Srinivasan 1999), information gain Lewis and Ringuette 1994;Moulinier, Raskinis et al 1996;Yang and Pedersen 1997;Larkey 1998;Mladenic and Grobelnik 1998;Caropreso, Matwin et al 2001), mutual information (Larkey and Croft 1996;Wai and Fan 1997;Dumais, Platt et al 1998;Taira and Haruno 1999) odds ratio (Mladenic and Grobelnik 1998;Ruiz and Srinivasan 1999;Caropreso, Matwin et al 2001), relevancy score (Wiener, Pedersen et al 1995) and GSS coefficient (Galavotti, Sebastiani et al 2000). Three of the most popular methods are descrivbed briefly below.…”

Section: This Leads To the Term Frequency/inverse Document Frequency mentioning

confidence: 99%

See 2 more Smart Citations

Text and Hypertext Categorization

Benbrahim

Bramer

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from 'conventional' classification tasks and from each other.

show abstract

Section: Machine Learning Approaches To Text Categorizationmentioning

confidence: 99%

Section: Indexingmentioning

confidence: 99%

Section: This Leads To the Term Frequency/inverse Document Frequency mentioning

confidence: 99%

See 1 more Smart Citation

Text and Hypertext Categorization

Benbrahim

Bramer

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…• DIAAF: The Darmstadt Indexing Approach (DIA) [11] was originally "developed for automatic indexing with a prescribed indexing vocabulary" [12]. In a machine learning context, Sebastiani [23] argues that this approach "considers properties (of terms, documents, categories, or pairwise relationships among these) as basic dimensions of the learning space".…”

Section: Statistical Feature Selectionmentioning

confidence: 99%

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

Wang¹,

Coenen

Sanderson

2009

Advanced Data Mining and Applications

View full text Add to dashboard Cite

Abstract. Data pre-processing is an important topic in Text Classification (TC).It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between textcategories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (i) linguistic, (ii) statistical, and (iii) hybrid (i) & (ii). With regard to language-independent TC, our study relates to the statistical aspect only. The nature of textual data pre-processing includes: Document-base Representation (DR) and Feature Selection (FS). In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti⋅Sebastiani⋅Simi Coefficient). Our proposed approach is presented under a statistical "bag of phrases" DR setting. The experimental results, based on the well-established associative text classification approach, demonstrate that our proposed technique outperforms existing mechanisms with respect to the accuracy of classification.

show abstract

“…Document indexing is defined as the task of assigning terms to documents for retrieval purposes [11]. The process consists of two generic steps: extracting the subject matter of a document, and expressing the subject matter in index terms to facilitate subject retrieval [12].…”

Section: Introductionmentioning

confidence: 99%

Source code indexing for automated tracing

Mahmoud

Niu

2011

Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering

View full text Add to dashboard Cite

Requirements-to-source-code traceability employs information retrieval (IR) methods to automatically link requirements to the source code that implements them. A crucial step in this process is indexing, where partial and important information from the software artifacts is converted into a representation that is compatible with the underlying IR model. Source code demands special attention in the indexing process. In this paper, we investigate source code indexing for supporting automatic traceability. We introduce a feature diagram that captures the key components and their relationships in the domain of source code indexing. We then present an experiment to examine the features of the diagram and their dependencies. Results show that utilizing comments has a significant effect on traceability link generation, and stemming is required when comments are considered.

show abstract

A probabilistic learning approach for document indexing

Cited by 146 publications

References 16 publications

Text and Hypertext Categorization

Text and Hypertext Categorization

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

Source code indexing for automated tracing

Contact Info

Product

Resources

About