MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization

Zhang, Changjian; Qi, Deyu; Cai, Zucong; Huang, Wenhao; Wang, Xinyang; Li, Wenlin; Guo, Jing

doi:10.1109/access.2019.2926195

Cited by 10 publications

(4 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Low entropy strings are strings which consist of repetitive bytes or patterns. This challenge means it is preferable for the algorithm to be able to eliminate the redundancy within this kind of string [32]. 4) High throughput [33].…”

Section: B Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Function of Content Defined Chunking Algorithms in Incremental Synchronization

Zhang

et al. 2020

IEEE Access

Self Cite

View full text Add to dashboard Cite

Data chunking algorithms divide data into several small data chunks in a certain way, thus transforming the operation of data into the one of multiple small data chunks. Data chunking algorithms have been widely used in duplicate data detection, parallel computing and other fields, but it is seldom used in data incremental synchronization. Aiming at the characteristics of incremental data synchronization, this paper proposes a novel data chunking algorithm. By dividing two data that need synchronization into small data chunks, comparing the contents of these small data chunks, different ones are the incremental data that need to be found. The new algorithm determines to set a cut-point based on the number of 1 contained in the binary format of all bytes in an interval. Thus it improves the resistance against the byte shifting problem at the expense of the chunk size stability, which makes it more suitable for the incremental data synchronization. Comparing this algorithm with several known classical or state of art algorithms, experiments show that the incremental data found by this algorithm can be reduced by 32%∼57% compared to the others with same changes between two data. The experimental results based on real-world datasets show that PCI improves the calculation speed of classic Rsync algorithm up to 70%, however, with a drawback of increasing the Transmission compression rate up to 11.8%. INDEX TERMS Data synchronization, chunking algorithm, data backup, increment.

show abstract

Section: B Motivationmentioning

confidence: 99%

“…In the previous research of our team, MII algorithm was proposed to achieve better ability of resistance against the byte shifting by sacrificing the stability of chunk size [32]. The pseudo code and chunking process of MII algorithm are shown in Fig.…”

Section: B Motivationmentioning

confidence: 99%

Function of Content Defined Chunking Algorithms in Incremental Synchronization

Zhang

et al. 2020

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the same way, the block level searches the content of the block, eliminates one copy of the league, and retains another block. The block-level of the file involves four processing steps; chunking, fingerprinting, indexing of fingerprints, and managing the stored information of data [5,6].…”

Section: Introductionmentioning

confidence: 99%

Health Data Deduplication Using Window Chunking-Signature Encryption in Cloud

Neelamegam¹,

Marikkannu²

2023

Intelligent Automation &Amp; Soft Computing

View full text Add to dashboard Cite

Due to the development of technology in medicine, millions of healthrelated data such as scanning the images are generated. It is a great challenge to store the data and handle a massive volume of data. Healthcare data is stored in the cloud-fog storage environments. This cloud-Fog based health model allows the users to get health-related data from different sources, and duplicated information is also available in the background. Therefore, it requires an additional storage area, increase in data acquisition time, and insecure data replication in the environment. This paper is proposed to eliminate the de-duplication data using a window size chunking algorithm with a biased sampling-based bloom filter and provide the health data security using the Advanced Signature-Based Encryption (ASE) algorithm in the Fog-Cloud Environment (WCA-BF + ASE). This WCA-BF + ASE eliminates the duplicate copy of the data and minimizes its storage space and maintenance cost. The data is also stored in an efficient and in a highly secured manner. The security level in the cloud storage environment Windows Chunking Algorithm (WSCA) has got 86.5%, two thresholds two divisors (TTTD) 80%, Ordinal in Python (ORD) 84.4%, Boom Filter (BF) 82%, and the proposed work has got better security storage of 97%. And also, after applying the de-duplication process, the proposed method WCA-BF + ASE has required only less storage space for various file sizes of 10 KB for 200, 400 MB has taken only 22 KB, and 600 MB has required 35 KB, 800 MB has consumed only 38 KB, 1000 MB has taken 40 KB of storage spaces.

show abstract

“…In fixed-length block deduplication, data is divided into chunks of a constant size, whereas in variable-length block deduplication, data is divided into distinct chunks based on different factors [9,12]. While block-level deduplication is more efficient than filelevel Deduplication, it requires more system resources [13,14].…”

Section: Introductionmentioning

confidence: 99%

Enhancing Deduplication Efficiency Using Triple Bytes Cutters and Multi Hash Function

2023

IJIES

View full text Add to dashboard Cite

Managing big data backups is challenging due to high volumes of redundant data. Data deduplication is widely used but incurs significant computational and time costs. This paper proposes a hybrid deduplication system that combines file-level and block-level methods to enhance deduplication while reducing costs. File-level deduplication eliminates duplicate files, while block-level deduplication is applied to non-duplicated files using a dynamic list of divisors to enhance deduplication. A multi-hash function generates three hash values for each file or chunk to improve chunking speed and reduce hash collisions. The proposed hybrid system outperforms other state-ofthe-art methods in terms of time, data deduplication ratio, and deduplication gain. Experimental results show reductions of 97.2%, 91.6%, and 82.1% in data size for Dataset 1, Dataset 2, and Dataset 3, respectively, and demonstrate that the proposed multi-hash function is faster and requires less storage than other hash functions.

show abstract

MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization

Cited by 10 publications

References 48 publications

Function of Content Defined Chunking Algorithms in Incremental Synchronization

Function of Content Defined Chunking Algorithms in Incremental Synchronization

Health Data Deduplication Using Window Chunking-Signature Encryption in Cloud

Enhancing Deduplication Efficiency Using Triple Bytes Cutters and Multi Hash Function

Contact Info

Product

Resources

About