This paper introduces a new algorithm for clustering data in high-dimensional feature spaces, called GARDENHD. The algorithm is organized around the notion of data space reduction, i.e. the process of detecting dense areas (dense cells) in the space. It performs effective and efficient elimination of empty areas that characterize typical high-dimensional spaces and an efficient adjacency-connected agglomeration of dense cells into larger clusters. It produces a compact representation that can effectively capture the essence of data. GARDENHD is a hybrid of cell-based and density-based clustering. However, unlike typical clustering methods in its class, it applies a recursive partition of sparse regions in the space using a new space-partitioning strategy. The properties of this partitioning strategy greatly facilitate data space reduction. The experiments on synthetic and real data sets reveal that GARDENHD and its data space reduction are effective, efficient, and scalable.
Broadcast is a scalable way of disseminating data because broadcasting an item satisfies all outstanding client requests for it. However, because the transmission medium is shared, individual requests may have high response times. In this paper, we show how to minimize the average response time given multiple broadcast channels by optimally partitioning data among them. We also offer an approximation algorithm that is less complex than the optimal and show that its performance is near-optimal for a wide range of parameters. Finally, we briefly discuss the extensibility of our work with two simple, yet seldom researched extensions, namely, handling varying sized items and generating single channel schedules.
To avoid the high cost of continuous connectivity, a class of mobile applications employs replicas of shared data that are periodically updated. Updates to these replicas are typically performed on a client-by-client basis-that is, the server individually computes and transmits updates to each clientlimiting scalability. By basing updates on replica groups (instead of clients), however, update generation complexity is no longer bound by client population size. Clients then download updates of pertinent groups. Proper group design reduces redundancies in server processing, disk usage and bandwidth usage, and dimininishes the tie between the complexity of updating replicas and the size of the client population. In this paper, we expand on previous work done on group design, include a detailed I/O cost model for update generation, and propose a heuristic-based greedy algorithm for group computation. Experimental results with an adapted commercial replication system demonstrate a significant increase in overall scalability over the client-centric approach. Figure 1: The update server maintains the primary copy and distributes updates on demand to intermittently connected clients that maintain replicas.
We introduce IR-Wire, a tool for information retrieval research and education in peer-to-peer file-sharing systems. Built on top of LimeWire's implementation of the popular Gnutella standard, it includes functionality to collect data on queries and shared files and stores them in a way to make analyses simple. IR-Wire is designed modularly to facilitate its customization for other uses.
Peer-to-peer file-sharing systems commonly use the set-of-terms model-the union of the terms in the shared files-to describe succinctly a peer's shared files. This information is shared with neighbors who use it to guide query routing decisions. The problem with this model, however, is that it falsely suggests term cooccurrences that do not exist in any single file. Consequently, queries get routed erroneously to peers that have no matching files, wasting network and computation resources in the process. We reduce the amount of co-occurrence errors by partitioning each peer's file set and representing the peer as several file partitions instead of one. Experimental evidence demonstrates that it is possible to reduce the network traffic between neighbors by over 50% at virtually no cost.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.