The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression

The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression. Xia, W., Pu, L., Zou, X., Shilane, P., Li, S., Zhang, H., & Wang, X. ACM Trans. Storage, Association for Computing Machinery, New York, NY, USA, jun, 2023.

Paper doi abstract bibtex

Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values. In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead).Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.

@article{10.1145/3584663,
author = {Xia, Wen and Pu, Lifeng and Zou, Xiangyu and Shilane, Philip and Li, Shiyi and Zhang, Haijun and Wang, Xuan},
title = {The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression},
year = {2023},
issue_date = {August 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {19},
number = {3},
issn = {1553-3077},
url = {https://doi.org/10.1145/3584663},
doi = {10.1145/3584663},
abstract = {Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values.&nbsp;&nbsp;In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data&nbsp;[1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead).Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4\texttimes{} and ∼7.9\texttimes{} faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform&nbsp;[39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20\texttimes{} and 1.41\texttimes{} higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22\texttimes{} higher compression ratio over Finesse.},
journal = {ACM Trans. Storage},
month = {jun},
articleno = {22},
numpages = {30},
keywords = {content-defined sampling, parallel rolling hash, SIMD, resemblance detection, Post-deduplication delta compression}
}

Downloads: 0

{"_id":"769qHrqBmRPNxocYY","bibbaseid":"xia-pu-zou-shilane-li-zhang-wang-thedesignoffastandlightweightresemblancedetectionforefficientpostdeduplicationdeltacompression-2023","author_short":["Xia, W.","Pu, L.","Zou, X.","Shilane, P.","Li, S.","Zhang, H.","Wang, X."],"bibdata":{"bibtype":"article","type":"article","author":[{"propositions":[],"lastnames":["Xia"],"firstnames":["Wen"],"suffixes":[]},{"propositions":[],"lastnames":["Pu"],"firstnames":["Lifeng"],"suffixes":[]},{"propositions":[],"lastnames":["Zou"],"firstnames":["Xiangyu"],"suffixes":[]},{"propositions":[],"lastnames":["Shilane"],"firstnames":["Philip"],"suffixes":[]},{"propositions":[],"lastnames":["Li"],"firstnames":["Shiyi"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Haijun"],"suffixes":[]},{"propositions":[],"lastnames":["Wang"],"firstnames":["Xuan"],"suffixes":[]}],"title":"The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression","year":"2023","issue_date":"August 2023","publisher":"Association for Computing Machinery","address":"New York, NY, USA","volume":"19","number":"3","issn":"1553-3077","url":"https://doi.org/10.1145/3584663","doi":"10.1145/3584663","abstract":"Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values.  In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead).Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.","journal":"ACM Trans. Storage","month":"jun","articleno":"22","numpages":"30","keywords":"content-defined sampling, parallel rolling hash, SIMD, resemblance detection, Post-deduplication delta compression","bibtex":"@article{10.1145/3584663,\nauthor = {Xia, Wen and Pu, Lifeng and Zou, Xiangyu and Shilane, Philip and Li, Shiyi and Zhang, Haijun and Wang, Xuan},\ntitle = {The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression},\nyear = {2023},\nissue_date = {August 2023},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nvolume = {19},\nnumber = {3},\nissn = {1553-3077},\nurl = {https://doi.org/10.1145/3584663},\ndoi = {10.1145/3584663},\nabstract = {Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values.  In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead).Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4\\texttimes{} and ∼7.9\\texttimes{} faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20\\texttimes{} and 1.41\\texttimes{} higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22\\texttimes{} higher compression ratio over Finesse.},\njournal = {ACM Trans. Storage},\nmonth = {jun},\narticleno = {22},\nnumpages = {30},\nkeywords = {content-defined sampling, parallel rolling hash, SIMD, resemblance detection, Post-deduplication delta compression}\n}\n\n","author_short":["Xia, W.","Pu, L.","Zou, X.","Shilane, P.","Li, S.","Zhang, H.","Wang, X."],"key":"10.1145/3584663","id":"10.1145/3584663","bibbaseid":"xia-pu-zou-shilane-li-zhang-wang-thedesignoffastandlightweightresemblancedetectionforefficientpostdeduplicationdeltacompression-2023","role":"author","urls":{"Paper":"https://doi.org/10.1145/3584663"},"keyword":["content-defined sampling","parallel rolling hash","SIMD","resemblance detection","Post-deduplication delta compression"],"metadata":{"authorlinks":{}},"downloads":0,"html":""},"bibtype":"article","biburl":"https://bibbase.org/network/files/QDBhwkwh8hDye6xL8","dataSources":["4x87iNQYFcnH8n6Cm"],"keywords":["content-defined sampling","parallel rolling hash","simd","resemblance detection","post-deduplication delta compression"],"search_terms":["design","fast","lightweight","resemblance","detection","efficient","post","deduplication","delta","compression","xia","pu","zou","shilane","li","zhang","wang"],"title":"The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression","year":2023}