Deduplication is a storage saving technique that is highly successful in enterprise backup environments. On a file system, a single data block might be stored multiple times across different files, for example, multiple versions of a file might exist that are mostly identical. With deduplication, this data replication is localized and redundancy is removed – by storing data just\ud once, all files that use identical regions refer to the same unique data. The most common approach splits file data into chunks\ud and calculates a cryptographic fingerprint for each chunk. By checking if the fingerprint has already been stored, a chunk is classified as redundant or unique. Only unique chunks are stored. This paper presents the first study on the potential of data deduplication in HPC centers, which belong to the most demanding storage producers. We have quantitatively assessed this potential for capacity reduction for 4 data centers (BSC, DKRZ,\ud RENCI, RWTH). In contrast to previous deduplication studies focusing mostly on backup data, we have analyzed over one PB\ud (1212 TB) of online file system data. The evaluation shows that typically 20% to 30% of this online data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This reduction can only be achieved by a subfile deduplication approach, while approaches based on whole-file\ud comparisons only lead to small capacity savings.Peer ReviewedPostprint (published version
In current supercomputers, storage is typically provided by parallel distributed file systems for hot data and tape archives for cold data. These file systems are often compatible with local file systems due to their use of the POSIX interface and semantics, which eases development and debugging because applications can easily run both on workstations and supercomputers. There is a wide variety of file systems to choose from, each tuned for different use cases and implementing different optimizations. However, the overall application performance is often held back by I/O bottlenecks due to insufficient performance of file systems or I/O libraries for highly parallel workloads. Performance problems are dealt with using novel storage hardware technologies as well as alternative I/O semantics and interfaces. These approaches have to be integrated into the storage stack seamlessly to make them convenient to use. Upcoming storage systems abandon the traditional POSIX interface and semantics in favor of alternative concepts such as object and key-value storage; moreover, they heavily rely on technologies such as NVM and burst buffers to improve performance. Additional tiers of storage hardware will increase the importance of hierarchical storage management. Many of these changes will be disruptive and require application developers to rethink their approaches to data management and I/O. A thorough understanding of today's storage infrastructures, including their strengths and weaknesses, is crucially important for designing and implementing scalable storage systems suitable for demands of exascale computing.
The different rates of increase for computational power and storage capabilities of supercomputers turn data storage into a technical and economical problem. As storage capabilities are lagging behind, investments and operational costs for storage systems have increased to keep up with the supercomputers' I/O requirements. One promising approach is to reduce the amount of data that is stored. In this paper, we take a look at the impact of compression on performance and costs of high performance systems. To this end, we analyze the applicability of compression on all layers of the I/O stack, that is, main memory, network and storage. Based on the Mistral system of the German Climate Computing Center (Deutsches Klimarechenzentrum, DKRZ), we illustrate potential performance improvements and cost savings. Making use of compression on a large scale can decrease investments and operational costs by 50 % without negative impact on the performance. Additionally, we present ongoing work for supporting enhanced adaptive compression in the parallel distributed file system Lustre and application-specific compression. 3 It has to be noted that the computational speed is based on the TOP500 list while the storage capacity and speed are based on single devices. 5 We explicitly define compression ratio as the fraction of uncompressed size over compressed size, that is, compression ratio = uncompressed size compressed size . Due to its convenience, the inverse compression ratio, that is, 1 divided by the compression ratio, will also be used at some points in the paper; it indicates the fraction to which data can be compressed.
This paper explains why collaboration is a cornerstone of so many successful Intelligent Energy (IE) programs, and how organisations can use what has been learnt about collaboration to support their IE activities, whether they have a mature program or are just starting on their journey. The paper will look at evidence for the importance of collaboration and why it is so frequently seen as a key element of transformational IE programs. The results from 24 IE assessments across different companies and assets point to collaboration as the most commonly-recommended opportunity area for inclusion within IE initiatives. We will review the current state of collaboration to identify value that has been delivered, and common principles and lessons that can be extracted from multiple implementations. We will then consider future directions for collaboration. A Collaboration Maturity Model and Roadmap will be introduced to explain the current state and potential future developments of collaboration across the industry. Although there has been much success delivered from collaboration to date, we believe that there is significantly more that could be achieved through further technical, visualisation, process and organisational innovation. The model will be used to help illustrate and explain potential future developments, and consider how organisations at all stages of maturity can increase the effectiveness of their collaboration activities. Collaboration has been one of the key successes of Intelligent Energy; however, as an industry we are still in the early stages on the journey of where collaboration and IE could take us. This paper charts our progress on that journey and sets out how we can use today's knowledge to accelerate and direct further developments to transform our business in the future.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.