Not only have Self-organizing Maps (SOMs), such as the WEBSOM, been shown to scale up to very large datasets, these maps also allow for a novel mode of navigating through a large collection of text documents. The entire text collection is presented to a user as a regular map, where each point in the map is associated to a group of documents that are likely to be composed of similar terms and phrases. In addition, the closer two points are in the map, the more similar are their respective associated documents. Thus, once an interesting document is found in the map, the user just has to click around the vicinity of that document to retrieve other similar documents. A major drawback of SOMs, however, is the long training time required, especially for document collections where both the volume and the dimensionality are huge. In this paper, we demonstrate how the size of the initial text collection is progressively and drastically reduced from the raw document collection to the final SOM-based text archive. We demonstrate this using a widely studied Reuters collection.
Self-Organizing Maps, being used mainly with data that are not pre-labeled, need automatic procedures for extracting keywords as labels for each of the map units. The WEBSOM methodology for building very large text archives has a very slow method for extracting such unit labels. It computes the relative frequencies of all the words of all the documents associated to each unit and then compares these to the relative frequencies of all the words of all the other units of the map. Since maps may have more than 100,000 units and the archive may contain up to 7 million documents, the existing WEBSOM method is not practical. This paper describes how the meaningful labels per map unit can be deduced by analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method used in dimensionality reduction. The effectiveness of this technique is demonstrated on archives of the well studied Reuters and CMV collections. Comparisons with the WEBSOM method are provided.
Keyword extraction is vital for Knowledge Management System, Information Retrieval System, and Digital Libraries as well as for general browsing of the web. Keywords are often the basis of document processing methods such as clustering and retrieval since processing all the words in the document can be slow. Common models for automating the process of keyword extraction are usually done by using several statistics-based methods such as Bayesian, K-Nearest Neighbor, and Expectation-Maximization. These models are limited by word-related features that can be used since adding more features will make the models more complex and difficult to comprehend. In this research, a Neural Network, specifically a backpropagation network, will be used in generalizing the relationship of the title and the content of articles in the archive by following word features other than TF-IDF, such as position of word in the sentence, paragraph, or in the entire document, and formats such as heading, and other attributes defined beforehand. In order to explain how the backpropagation network works, a rule extraction method will be used to extract symbolic data from the resulting backpropagation network. The rules extracted can then be transformed into decision trees performing almost as accurate as the network plus the benefit of being in an easily comprehensible format.
The WEBSOM methodology, proven effective for building very large text archives, includes a method that extracts labels for each document cluster assigned to nodes in the map. However, the WEBSOM method needs to retrieve all the words of all the documents associated to each node. Since maps may have more than 100,000 nodes and since the archive may contain up to seven million documents, the WEBSOM methodology needs a faster alternative method for keyword selection. Presented here is such an alternative method that is able to quickly deduce meaningful labels per node in the map. It does this just by analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method used in dimensionality reduction. The effectiveness of this technique is demonstrated on news document collections.Index Terms-Keyword extraction, text archives, WEBSOM, random projection. ae VERY LARGE TEXT ARCHIVESTEXT archives now run in the order of millions of documents. With such sizes of text archives being processed, words that appear at least once in the text corpus, even after removal of very common (stop) words, run to more than 100,000! One important methodology that is effective in archiving up to seven million text documents is the WEBSOM [1], [2], [3], [4], [5], [6], [7], which uses a "Self-Organizing Map" (SOM) at the core of its archiving technique. A number of other SOM-based text archiving techniques have been described in the literature [8], [9], [10], [11]. They differ mainly in the preprocessing and postprocessing stages of the archiving process.WEBSOMs need automatic procedures for extracting keywords of archived documents which are useful for browsing purposes. Extracting keywords for maps based on the WEBSOM methodology, however, is not straightforward because of a random projection method that is employed to compress the large but sparse input term frequency vectors. The WEBSOM methodology does include an automatic keyword extraction procedure [6], which we refer to as the "Lagus method." The procedure, however, is rather slow. It computes the relative frequencies of all the words of all the documents associated to each node and then compares these to the relative frequencies of words of the other nodes in the map. Since maps may have more than 100,000 nodes and the archive may contain up to seven million documents, the existing method is not practical. Another keyword extraction method has been reported [8], but this was not done on input vectors that have gone through random projection.The alternative method we describe in this paper deduces the keywords by just analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method. Our method is several orders of magnitude faster than the current WEBSOM technique, and yet retrieves fairly the same keywords as the WEBSOM method.Keyword selection is important because it allows SOMs to be used as a novel interface for nav...
In this paper, we present an approach for sample selection using an ensemble of neural networks for credit scoring. The ensemble determines samples that can be considered outliers by checking the classification accuracy of the neural networks on the original training data samples. Those samples that are consistently misclassified by the neural networks in the ensemble are removed from the training dataset. The remaining data samples are then used to train and prune another neural network for rule extraction. Our experimental results on publicly available benchmark credit scoring datasets show that by eliminating the outliers, we obtain neural networks with higher predictive accuracy and simpler in structure compared to the networks that are trained with the original dataset. A rule extraction algorithm is applied to generate comprehensible rules from the neural networks. The extracted rules are more concise than the rules generated from networks that have been trained using the original datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.