Abstract:Abstract. Cooperative Web caching is the most common solution for augmenting the low cache hit rates due to single proxies. However, both purely hierarchical and flat architectures suffer from scalability problems due to cooperation protocol overheads. We present a new cooperative architecture that organizes cache servers in well connected clusters and implements a novel cooperation model based on a two-tier lookup process. The experimental results carried out on a working prototype show that the proposed arch… Show more
“…The approach is reminiscent of cooperative caching [18], cooperative web-caching [19], and peer-to-peer storage systems [17]. (Other data-aware scheduling approaches tend to assume static resources [1,2].)…”
Section: Introductionmentioning
confidence: 99%
“…Data diffusion thus involves a combination of dynamic resource provisioning, data caching, and data-aware scheduling. The approach is reminiscent of cooperative caching [18], cooperative web-caching [19], and peer-to-peer storage systems [17]. (Other data-aware scheduling approaches tend to assume static resources [1,2].)…”
Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both microbenchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.
“…The approach is reminiscent of cooperative caching [18], cooperative web-caching [19], and peer-to-peer storage systems [17]. (Other data-aware scheduling approaches tend to assume static resources [1,2].)…”
Section: Introductionmentioning
confidence: 99%
“…Data diffusion thus involves a combination of dynamic resource provisioning, data caching, and data-aware scheduling. The approach is reminiscent of cooperative caching [18], cooperative web-caching [19], and peer-to-peer storage systems [17]. (Other data-aware scheduling approaches tend to assume static resources [1,2].)…”
Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both microbenchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.
“…n those systems d the growing ious forms of on, provenance oosely coupled documents, or complexity of rocessing and ting processes At the low end upled Message move into the xample for this ves us into the n [6,7] and n of both many g [9] category ryad [11], and s Data diffusion involves a combination of dynamic resource provisioning, data caching, and dataaware scheduling. The approach is reminiscent of cooperative caching [27], cooperative webcaching [28], and peer-to-peer storage systems [29]. Other data-aware scheduling approaches tend to assume static resources [30,31], in which a system configuration dedicates nodes with roles (i.e.…”
Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.