Figure 1: Island-like visualization of a document point cloud's topological structure. By sharing similar dimensions, documents accumulate in subspaces of the high dimensional information space. Considering dimensions as words, clusters are assumed to describe topics, i.e., islands, in the final visualization.
ABSTRACTDuring the last decades, electronic textual information has become the world's largest and most important information source available. People have added a variety of daily newspapers, books, scientific and governmental publications, blogs and private messages to this wellspring of endless information and knowledge. Since neither the existing nor the new information can be read in its entirety, computers are used to extract and visualize meaningful or interesting topics and documents from this huge information clutter.In this paper, we extend, improve and combine existing individual approaches into an overall framework that supports topological analysis of high dimensional document point clouds given by the well-known tf-idf document-term weighting method. We show that traditional distance-based approaches fail in very high dimensional spaces, and we describe an improved two-stage method for topology-based projections from the original high dimensional information space to both two dimensional (2-D) and three dimensional (3-D) visualizations. To show the accuracy and usability of this framework, we compare it to methods introduced recently and apply it to complex document and patent collections.