A blueprint for automatic indexing

Salton, G.

doi:10.1145/263868.263871

Cited by 6 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Borko [6] advocates the use of pre-selected terms to represent each document. In contrast, [7,9,14,16,18,20,23] extract all the terms from the documents to act as content-representatives-albeit using some filtering sometimes, such as using only the highest-weighted n terms [7]. In the majority of cases, stop word lists are used to remove words that occur too frequently (e.g., "the", "of", "and", ...), and suffix stripping routines (such as Porter's stemming routine [24]) are used to reduce words to their stems [7,14,16,[18][19][20]22,23,25].…”

Section: Introductionmentioning

confidence: 93%

“…The importance of a term as a representative of the document's content may be calculated using the Inverse Document Frequency (IDF), which is the ratio of the term occurrence frequency within the document in question to the occurrence frequency of the term over the whole document collection [18]. Gulli [12] calculates the term weights by utilising the TF.IDF measure that is centred on the DMOZ (Open Directory Project) Categories.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Incremental Clustering of News Reports

Azzopardi

Staff

2012

Algorithms

View full text Add to dashboard Cite

When an event occurs in the real world, numerous news reports describing this event start to appear on different news sites within a few minutes of the event occurrence. This may result in a huge amount of information for users, and automated processes may be required to help manage this information. In this paper, we describe a clustering system that can cluster news reports from disparate sources into event-centric clusters-i.e., clusters of news reports describing the same event. A user can identify any RSS feed as a source of news he/she would like to receive and our clustering system can cluster reports received from the separate RSS feeds as they arrive without knowing the number of clusters in advance. Our clustering system was designed to function well in an online incremental environment. In evaluating our system, we found that our system is very good in performing fine-grained clustering, but performs rather poorly when performing coarser-grained clustering.

show abstract

Section: Introductionmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 99%

Incremental Clustering of News Reports

Azzopardi

Staff

2012

Algorithms

View full text Add to dashboard Cite

show abstract

“…Similarity scores between two records are computed using a cosine measure applied in conjunction with a vector space document representation 3 . In the vector space representation, the textual content of each record is represented as a "bag of words", which tallies the frequency of terms used in a record but does not account for word order, sentence structure, or semantic features of the content.…”

Section: Computing Similaritymentioning

confidence: 99%

“…Each record can be represented as an n-dimensional vector, where n is the total number of distinct terms occurring in the record corpus. The value of the n th dimension of the record"s vector can be assigned a weight, for instance by using the TF-IDF formula 3 (see Eq. (1)), which employs term frequency (TF) and inverse document frequency (IDF).…”

Section: Computing Similaritymentioning

confidence: 99%

XSearch: A System for Searching and Interrelating NASA Mission Operations Data

Keller¹,

Wolfe²,

Windrem³

et al. 2008

SpaceOps 2008 Conference

View full text Add to dashboard Cite

To perform their duties, flight controllers for the Space Shuttle and International Space Station routinely need to locate specific operations documents and records from among tens of thousands available. Relevant information, in the form of team notes, console logs, action requests, anomaly reports, and flight procedures, is stored in heterogeneous databases and accessed using varying tools with differing interfaces. Users must consult these different systems and manually integrate results together to get a comprehensive view of relevant information. To improve access to flight control information, we have designed and built an application, XSearch, that integrates data from disparate flight operations databases. Through a common search interface, Mission Control personnel using XSearch can issue a single search query and simultaneously interrogate multiple mission operations data sources. The initial version of XSearch is planned for deployment in NASA's Mission Control Center in mid-2008, and will allow Shuttle and Station flight controllers to search simultaneously across three key mission operations data sources: the ChitS system (used to store mission action requests), the Flight Notes system (used to store internal flight control team communications), and the Anomaly Reporting System. These systems store historical data back to 2002, and contain over 100,000 records in total. XSearch users can perform full-text searches on key text fields (e.g. title, problem description, action request, etc.) and view integrated results across these data sources. In addition to conducting search, the system provides two other important capabilities that are intended to contextualize search results: detection of cross-references and detection of textually similar records. Identifying records that are either cross-referenced by, or similar to, a given search result enables flight controllers to recover key information about the operational context associated with that result. The goal of contextualization is to facilitate safer and more effective mission operations decision-making through enhanced situation awareness. To detect embedded cross-references within results, XSearch parses text fields found in the results using a set of syntactic patterns that identify citations (e.g., patterns that detect controlled document or record identifiers routinely used by authors). Using this technique, XSearch can identify both "outbound" and "inbound" references. Outbound references point "out" from a specific chit, flight note, or anomaly to other records; "inbound" references point "in" to the specific item from other records. To detect records that appear similar to a given search result, XSearch calculates and ranks the textual similarity between the result and all other records in the corpus; those ranked highest are displayed to the flight controller. Similarity detection is computed using a standard cosine-based vector space information retrieval method weighted by term American Institute of Aeronautics and Astronautics 092407...

show abstract

“…First, the feature term are extracted by applying the techniques from the "Classic Blueprint for Automatic Indexing" [15] . It forms a documents set D=(D 1 , D 2 , …, D n ).…”

Section: Concept Lattice Based User Profilementioning

confidence: 99%

Research in concept lattice based automatic document ranking

Tang

Shen

2005

2005 International Conference on Machine Learning and Cybernetics

View full text Add to dashboard Cite

The research on similarity for measuring document relevance is an important field in information retrieval. Many researchers are using concept lattice defined in Formal Concept Analysis (FCA) as a basis for measuring query-document relevance in text retrieval, i.e. Concept lattice-based ranking (CLR). However, formal Concept Analysis's notion of similarity for measuring documents relevance in text retrieval is only based on the shortest path linking the query to the document. It is not well defined. To resolve the problems of this approach, first, we evaluate reasonable different weights of edges in the Hasse diagram based on the conceptual generality or specificity. Second, we present a user profile based on Concept Lattice, and the algorithm for constructing Concept Lattice based user profile is provided. Third, we present a combination CLR approach by measuring the similarity among query, user profile and document according to the relation between query and user interest based on Concept Lattice. Our experiment shows that documents retrieved by our combination CLR approach achieve a higher measure of precision than the traditional CLR approach.

show abstract

A blueprint for automatic indexing

Cited by 6 publications

References 0 publications

Incremental Clustering of News Reports

Incremental Clustering of News Reports

XSearch: A System for Searching and Interrelating NASA Mission Operations Data

Research in concept lattice based automatic document ranking

Contact Info

Product

Resources

About