Redundancy elimination within large collections of files

Redundancy elimination within large collections of files. Kulkarni, P., Douglis, F., Lavoie, J., & Tracey, J. M. 2004.

Ongoing advancements in technology lead to ever-increasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on delta-encoding, reducing overall space significantly in one case. Furthermore, REBL uses super-fingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns O(n2) comparisons into hash table lookups. As a result, using super-fingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude.

@conference {1247420,
	title = {Redundancy elimination within large collections of files},
	booktitle = {ATEC {\textquoteright}04: Proceedings of the annual conference on USENIX Annual Technical Conference},
	year = {2004},
	pages = {5{\textendash}5},
	publisher = {USENIX Association},
	organization = {USENIX Association},
	address = {Berkeley, CA, USA},
	abstract = {Ongoing advancements in technology lead to ever-increasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on delta-encoding, reducing overall space significantly in one case. Furthermore, REBL uses super-fingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns O(n2) comparisons into hash table lookups. As a result, using super-fingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude.
},
	url = {http://portal.acm.org/citation.cfm?id=1247420$\#$},
	author = {Kulkarni, Purushottam and Douglis, Fred and Jason Lavoie and Tracey, John M.}
}

Downloads: 0

{"_id":"STx6Ey83tG8E66RNK","bibbaseid":"kulkarni-douglis-lavoie-tracey-redundancyeliminationwithinlargecollectionsoffiles-2004","downloads":0,"creationDate":"2018-07-03T04:50:27.373Z","title":"Redundancy elimination within large collections of files","author_short":["Kulkarni, P.","Douglis, F.","Lavoie, J.","Tracey, J. M."],"year":2004,"bibtype":"conference","biburl":"https://gnunet.org/bibliography/export/bibtex","bibdata":{"bibtype":"conference","type":"conference","title":"Redundancy elimination within large collections of files","booktitle":"ATEC \\textquoteright04: Proceedings of the annual conference on USENIX Annual Technical Conference","year":"2004","pages":"5\\textendash5","publisher":"USENIX Association","organization":"USENIX Association","address":"Berkeley, CA, USA","abstract":"Ongoing advancements in technology lead to ever-increasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on delta-encoding, reducing overall space significantly in one case. Furthermore, REBL uses super-fingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns O(n2) comparisons into hash table lookups. As a result, using super-fingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude. ","url":"http://portal.acm.org/citation.cfm?id=1247420$#$","author":[{"propositions":[],"lastnames":["Kulkarni"],"firstnames":["Purushottam"],"suffixes":[]},{"propositions":[],"lastnames":["Douglis"],"firstnames":["Fred"],"suffixes":[]},{"firstnames":["Jason"],"propositions":[],"lastnames":["Lavoie"],"suffixes":[]},{"propositions":[],"lastnames":["Tracey"],"firstnames":["John","M."],"suffixes":[]}],"bibtex":"@conference {1247420,\n\ttitle = {Redundancy elimination within large collections of files},\n\tbooktitle = {ATEC {\\textquoteright}04: Proceedings of the annual conference on USENIX Annual Technical Conference},\n\tyear = {2004},\n\tpages = {5{\\textendash}5},\n\tpublisher = {USENIX Association},\n\torganization = {USENIX Association},\n\taddress = {Berkeley, CA, USA},\n\tabstract = {Ongoing advancements in technology lead to ever-increasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on delta-encoding, reducing overall space significantly in one case. Furthermore, REBL uses super-fingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns O(n2) comparisons into hash table lookups. As a result, using super-fingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude.\r\n},\n\turl = {http://portal.acm.org/citation.cfm?id=1247420$\\#$},\n\tauthor = {Kulkarni, Purushottam and Douglis, Fred and Jason Lavoie and Tracey, John M.}\n}\n","author_short":["Kulkarni, P.","Douglis, F.","Lavoie, J.","Tracey, J. M."],"key":"1247420","id":"1247420","bibbaseid":"kulkarni-douglis-lavoie-tracey-redundancyeliminationwithinlargecollectionsoffiles-2004","role":"author","urls":{"Paper":"http://portal.acm.org/citation.cfm?id=1247420$#$"},"downloads":0},"search_terms":["redundancy","elimination","within","large","collections","files","kulkarni","douglis","lavoie","tracey"],"keywords":[],"authorIDs":[],"dataSources":["FWsPTwsmjtrBtRS3B"]}