Jr. Yap Tn scite author profile

Jr. Yap Tn

1Publication

5Citation Statements Received

24Citation Statements Given

How they've been cited

How they cite others

Affiliations

Publications

Order By: Most citations

Evaluating keyword selection methods for websom text archives

Azcarraga

Tn²,

Tan³

et al. 2004

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

The WEBSOM methodology, proven effective for building very large text archives, includes a method that extracts labels for each document cluster assigned to nodes in the map. However, the WEBSOM method needs to retrieve all the words of all the documents associated to each node. Since maps may have more than 100,000 nodes and since the archive may contain up to seven million documents, the WEBSOM methodology needs a faster alternative method for keyword selection. Presented here is such an alternative method that is able to quickly deduce meaningful labels per node in the map. It does this just by analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method used in dimensionality reduction. The effectiveness of this technique is demonstrated on news document collections.Index Terms-Keyword extraction, text archives, WEBSOM, random projection. ae VERY LARGE TEXT ARCHIVESTEXT archives now run in the order of millions of documents. With such sizes of text archives being processed, words that appear at least once in the text corpus, even after removal of very common (stop) words, run to more than 100,000! One important methodology that is effective in archiving up to seven million text documents is the WEBSOM [1], [2], [3], [4], [5], [6], [7], which uses a "Self-Organizing Map" (SOM) at the core of its archiving technique. A number of other SOM-based text archiving techniques have been described in the literature [8], [9], [10], [11]. They differ mainly in the preprocessing and postprocessing stages of the archiving process.WEBSOMs need automatic procedures for extracting keywords of archived documents which are useful for browsing purposes. Extracting keywords for maps based on the WEBSOM methodology, however, is not straightforward because of a random projection method that is employed to compress the large but sparse input term frequency vectors. The WEBSOM methodology does include an automatic keyword extraction procedure [6], which we refer to as the "Lagus method." The procedure, however, is rather slow. It computes the relative frequencies of all the words of all the documents associated to each node and then compares these to the relative frequencies of words of the other nodes in the map. Since maps may have more than 100,000 nodes and the archive may contain up to seven million documents, the existing method is not practical. Another keyword extraction method has been reported [8], but this was not done on input vectors that have gone through random projection.The alternative method we describe in this paper deduces the keywords by just analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method. Our method is several orders of magnitude faster than the current WEBSOM technique, and yet retrieves fairly the same keywords as the WEBSOM method.Keyword selection is important because it allows SOMs to be used as a novel interface for nav...

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jr. Yap Tn

Evaluating keyword selection methods for websom text archives

Contact Info

Product

Resources

About