1997
DOI: 10.1145/263868.263871
|View full text |Cite
|
Sign up to set email alerts
|

A blueprint for automatic indexing

Abstract: This note summarizes some of the currently available insights in automatic indexing. The emphasis is on aspects that are expected to be useful in practical automatic indexing applications. The discussion is necessarily cursory, but the references will lead interested readers to a deeper treatment of the indexing problem.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

1999
1999
2012
2012

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…Borko [6] advocates the use of pre-selected terms to represent each document. In contrast, [7,9,14,16,18,20,23] extract all the terms from the documents to act as content-representatives-albeit using some filtering sometimes, such as using only the highest-weighted n terms [7]. In the majority of cases, stop word lists are used to remove words that occur too frequently (e.g., "the", "of", "and", ...), and suffix stripping routines (such as Porter's stemming routine [24]) are used to reduce words to their stems [7,14,16,[18][19][20]22,23,25].…”
Section: Introductionmentioning
confidence: 93%
See 1 more Smart Citation
“…Borko [6] advocates the use of pre-selected terms to represent each document. In contrast, [7,9,14,16,18,20,23] extract all the terms from the documents to act as content-representatives-albeit using some filtering sometimes, such as using only the highest-weighted n terms [7]. In the majority of cases, stop word lists are used to remove words that occur too frequently (e.g., "the", "of", "and", ...), and suffix stripping routines (such as Porter's stemming routine [24]) are used to reduce words to their stems [7,14,16,[18][19][20]22,23,25].…”
Section: Introductionmentioning
confidence: 93%
“…The importance of a term as a representative of the document's content may be calculated using the Inverse Document Frequency (IDF), which is the ratio of the term occurrence frequency within the document in question to the occurrence frequency of the term over the whole document collection [18]. Gulli [12] calculates the term weights by utilising the TF.IDF measure that is centred on the DMOZ (Open Directory Project) Categories.…”
Section: Introductionmentioning
confidence: 99%
“…Similarity scores between two records are computed using a cosine measure applied in conjunction with a vector space document representation 3 . In the vector space representation, the textual content of each record is represented as a "bag of words", which tallies the frequency of terms used in a record but does not account for word order, sentence structure, or semantic features of the content.…”
Section: Computing Similaritymentioning
confidence: 99%
“…Each record can be represented as an n-dimensional vector, where n is the total number of distinct terms occurring in the record corpus. The value of the n th dimension of the record"s vector can be assigned a weight, for instance by using the TF-IDF formula 3 (see Eq. (1)), which employs term frequency (TF) and inverse document frequency (IDF).…”
Section: Computing Similaritymentioning
confidence: 99%
“…First, the feature term are extracted by applying the techniques from the "Classic Blueprint for Automatic Indexing" [15] . It forms a documents set D=(D 1 , D 2 , …, D n ).…”
Section: Concept Lattice Based User Profilementioning
confidence: 99%