Extracting meaningful labels for WEBSOM text archives

IEEE Trans. Knowl. Data Eng.

Tn²,

Tan³

et al. 2004

The WEBSOM methodology, proven effective for building very large text archives, includes a method that extracts labels for each document cluster assigned to nodes in the map. However, the WEBSOM method needs to retrieve all the words of all the documents associated to each node. Since maps may have more than 100,000 nodes and since the archive may contain up to seven million documents, the WEBSOM methodology needs a faster alternative method for keyword selection. Presented here is such an alternative method that is able to quickly deduce meaningful labels per node in the map. It does this just by analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method used in dimensionality reduction. The effectiveness of this technique is demonstrated on news document collections.Index Terms-Keyword extraction, text archives, WEBSOM, random projection. ae VERY LARGE TEXT ARCHIVESTEXT archives now run in the order of millions of documents. With such sizes of text archives being processed, words that appear at least once in the text corpus, even after removal of very common (stop) words, run to more than 100,000! One important methodology that is effective in archiving up to seven million text documents is the WEBSOM [1], [2], [3], [4], [5], [6], [7], which uses a "Self-Organizing Map" (SOM) at the core of its archiving technique. A number of other SOM-based text archiving techniques have been described in the literature [8], [9], [10], [11]. They differ mainly in the preprocessing and postprocessing stages of the archiving process.WEBSOMs need automatic procedures for extracting keywords of archived documents which are useful for browsing purposes. Extracting keywords for maps based on the WEBSOM methodology, however, is not straightforward because of a random projection method that is employed to compress the large but sparse input term frequency vectors. The WEBSOM methodology does include an automatic keyword extraction procedure [6], which we refer to as the "Lagus method." The procedure, however, is rather slow. It computes the relative frequencies of all the words of all the documents associated to each node and then compares these to the relative frequencies of words of the other nodes in the map. Since maps may have more than 100,000 nodes and the archive may contain up to seven million documents, the existing method is not practical. Another keyword extraction method has been reported [8], but this was not done on input vectors that have gone through random projection.The alternative method we describe in this paper deduces the keywords by just analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method. Our method is several orders of magnitude faster than the current WEBSOM technique, and yet retrieves fairly the same keywords as the WEBSOM method.Keyword selection is important because it allows SOMs to be used as a novel interface for nav...

Section: Very Large Text Archivesmentioning

confidence: 99%

Section: Extracting Abstract For News Archivesmentioning

confidence: 99%

Evaluating keyword selection methods for websom text archives

IEEE Trans. Knowl. Data Eng.

Tn²,

Tan³

et al. 2004

“…Other approaches that are statistical and computational in nature, use data-driven machine learning algorithms to perform the task of distinguishing keywords from non-keywords would include Genetic Algorithms, Support Vector Machines, Decision Trees, Self-Organizing Maps, and Artificial Neural Networks [17] [23].…”

Section: Introductionmentioning

confidence: 99%

Tagging documents using neural networks based on local word features

2014 International Joint Conference on Neural Networks (IJCNN)

Tensuan

Setiono

2014

Keywords and key-phrases that concisely represent text documents are integral to many knowledge management and text information retrieval systems, as well as digital libraries in general. Not all text documents, however, are annotated with good keywords; and the quality of these keywords is often dependent on a tedious, sometimes manual, extraction and tagging process. To automatically extract high quality keywords without the need for a semantic analysis of the document, it is shown that artificial neural networks (ANN) can be trained to only consider in-document word features such as word frequency, word distribution in document, use of word in special parts of the document, and use of word formatting features (i.e. bold-faced, italicized, large-font size). Results show that purely local features are adequate in determining whether a word in a document is a keyword or not. Classification performance yields a G mean of a least 0.83, and weighted f-measure of 0.96 for both keywords and nonkeywords. Precision for keywords alone, however, is not as high. To understand the basis for classifying keywords, C4.5 is used to extract rules from the ANN. The extracted rules from C4.5, in the form of a decision tree, show the relative importance of the different document features that were extracted.

“…This is true especially for very large archives with millions or more articles. Processing all the words in the documents, as if they are of equal importance, as basis for finding relevant articles would be slow and not practical [1]. That is why it is important to have a set of good keywords that represent the actual contents of the document.…”

Section: Introductionmentioning

confidence: 99%

Keyword extraction using backpropagation neural networks and rule extraction

The 2012 International Joint Conference on Neural Networks (IJCNN)

Liu

Setiono

2012

Keyword extraction is vital for Knowledge Management System, Information Retrieval System, and Digital Libraries as well as for general browsing of the web. Keywords are often the basis of document processing methods such as clustering and retrieval since processing all the words in the document can be slow. Common models for automating the process of keyword extraction are usually done by using several statistics-based methods such as Bayesian, K-Nearest Neighbor, and Expectation-Maximization. These models are limited by word-related features that can be used since adding more features will make the models more complex and difficult to comprehend. In this research, a Neural Network, specifically a backpropagation network, will be used in generalizing the relationship of the title and the content of articles in the archive by following word features other than TF-IDF, such as position of word in the sentence, paragraph, or in the entire document, and formats such as heading, and other attributes defined beforehand. In order to explain how the backpropagation network works, a rule extraction method will be used to extract symbolic data from the resulting backpropagation network. The rules extracted can then be transformed into decision trees performing almost as accurate as the network plus the benefit of being in an easily comprehensible format.