Web-scale Content Reuse Detection (extended)

Web-scale Content Reuse Detection (extended). Ardi, C. & Heidemann, J. Technical Report ISI-TR-2014-692, USC/Information Sciences Institute, June, 2014.

Paper abstract bibtex

With the vast amount of accessible, online content, it is not surprising that unscrupulous entities ``borrow'' from the web to provide filler for advertisements, link farms, and spam and make a quick profit. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically \emphdiscover previously unknown duplicate content in the web, and the second to \emphdetect copies of discovered or manually identified content in the web. Our detection can also \emphbad neighborhoods, clusters of pages where copied content is frequent. We verify our approach with controlled experiments with two large datasets: a Common Crawl subset the web, and a copy of Geocities, an older set of user-provided web content. We then demonstrate that we can discover otherwise unknown examples of duplication for spam, and detect both discovered and expert-identified content in these large datasets. Utilizing an original copy of Wikipedia as identified content, we find 40 sites that reuse this content, 86% for commercial benefit.

@TechReport{Ardi14a,
	author = 	"Calvin Ardi and John Heidemann",
	title = "Web-scale Content Reuse Detection (extended)",
	institution = 	"USC/Information Sciences Institute",
	year = 		2014,
	sortdate = 		"2014-06-01",
	number =	"ISI-TR-2014-692",
	month =		jun,
	jlocation =	"johnh: pafile",
	keywords =	"hashing, content reuse, wikipedia, copying",
	url =		"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.html",
	pdfurl =	"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.pdf",
	otherurl = "ftp://ftp.isi.edu/isi-pubs/tr-692.pdf",
	myorganization =	"USC/Information Sciences Institute",
	copyrightholder = "authors",
	project = "ant, mega",
	abstract = "
With the vast amount of accessible, online content, it is not
surprising that unscrupulous entities ``borrow'' from the web to
provide filler for advertisements, link farms, and spam and make a
quick profit.  Our insight is that cryptographic hashing and
fingerprinting can efficiently identify content reuse for web-size
corpora.  We develop two related algorithms, one to 
automatically \emph{discover} previously unknown 
duplicate content in the web, and
the second to \emph{detect} copies of discovered or manually
identified content in the web.  Our detection can also \emph{bad
neighborhoods}, clusters of pages where copied content is frequent.
We verify our approach with controlled experiments with two large
datasets: a Common Crawl subset the web, and a copy of Geocities, an
older set of user-provided web content.  We then demonstrate that we
can discover otherwise unknown examples of duplication for spam, and
detect both discovered and expert-identified content in these large
datasets.  Utilizing an original copy of Wikipedia as identified
content, we find 40 sites that reuse this content, 86\% for commercial
benefit.
",
}

Downloads: 0

{"_id":"JxECpPyd5w7bjQgoG","bibbaseid":"ardi-heidemann-webscalecontentreusedetectionextended-2014","author_short":["Ardi, C.","Heidemann, J."],"bibdata":{"bibtype":"techreport","type":"techreport","author":[{"firstnames":["Calvin"],"propositions":[],"lastnames":["Ardi"],"suffixes":[]},{"firstnames":["John"],"propositions":[],"lastnames":["Heidemann"],"suffixes":[]}],"title":"Web-scale Content Reuse Detection (extended)","institution":"USC/Information Sciences Institute","year":"2014","sortdate":"2014-06-01","number":"ISI-TR-2014-692","month":"June","jlocation":"johnh: pafile","keywords":"hashing, content reuse, wikipedia, copying","url":"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.html","pdfurl":"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.pdf","otherurl":"ftp://ftp.isi.edu/isi-pubs/tr-692.pdf","myorganization":"USC/Information Sciences Institute","copyrightholder":"authors","project":"ant, mega","abstract":"With the vast amount of accessible, online content, it is not surprising that unscrupulous entities ``borrow'' from the web to provide filler for advertisements, link farms, and spam and make a quick profit. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically \\emphdiscover previously unknown duplicate content in the web, and the second to \\emphdetect copies of discovered or manually identified content in the web. Our detection can also \\emphbad neighborhoods, clusters of pages where copied content is frequent. We verify our approach with controlled experiments with two large datasets: a Common Crawl subset the web, and a copy of Geocities, an older set of user-provided web content. We then demonstrate that we can discover otherwise unknown examples of duplication for spam, and detect both discovered and expert-identified content in these large datasets. Utilizing an original copy of Wikipedia as identified content, we find 40 sites that reuse this content, 86% for commercial benefit. ","bibtex":"@TechReport{Ardi14a,\n\tauthor = \t\"Calvin Ardi and John Heidemann\",\n\ttitle = \"Web-scale Content Reuse Detection (extended)\",\n\tinstitution = \t\"USC/Information Sciences Institute\",\n\tyear = \t\t2014,\n\tsortdate = \t\t\"2014-06-01\",\n\tnumber =\t\"ISI-TR-2014-692\",\n\tmonth =\t\tjun,\n\tjlocation =\t\"johnh: pafile\",\n\tkeywords =\t\"hashing, content reuse, wikipedia, copying\",\n\turl =\t\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.html\",\n\tpdfurl =\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.pdf\",\n\totherurl = \"ftp://ftp.isi.edu/isi-pubs/tr-692.pdf\",\n\tmyorganization =\t\"USC/Information Sciences Institute\",\n\tcopyrightholder = \"authors\",\n\tproject = \"ant, mega\",\n\tabstract = \"\nWith the vast amount of accessible, online content, it is not\nsurprising that unscrupulous entities ``borrow'' from the web to\nprovide filler for advertisements, link farms, and spam and make a\nquick profit. Our insight is that cryptographic hashing and\nfingerprinting can efficiently identify content reuse for web-size\ncorpora. We develop two related algorithms, one to \nautomatically \\emph{discover} previously unknown \nduplicate content in the web, and\nthe second to \\emph{detect} copies of discovered or manually\nidentified content in the web. Our detection can also \\emph{bad\nneighborhoods}, clusters of pages where copied content is frequent.\nWe verify our approach with controlled experiments with two large\ndatasets: a Common Crawl subset the web, and a copy of Geocities, an\nolder set of user-provided web content. We then demonstrate that we\ncan discover otherwise unknown examples of duplication for spam, and\ndetect both discovered and expert-identified content in these large\ndatasets. Utilizing an original copy of Wikipedia as identified\ncontent, we find 40 sites that reuse this content, 86\\% for commercial\nbenefit.\n\",\n}\n\n\n","author_short":["Ardi, C.","Heidemann, J."],"bibbaseid":"ardi-heidemann-webscalecontentreusedetectionextended-2014","role":"author","urls":{"Paper":"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.html"},"keyword":["hashing","content reuse","wikipedia","copying"],"metadata":{"authorlinks":{}}},"bibtype":"techreport","biburl":"https://bibbase.org/f/dHevizJoWEhWowz8q/johnh-2023-2.bib","dataSources":["YLyu3mj3xsBeoqiHK","fLZcDgNSoSuatv6aX","fxEParwu2ZfurScPY","7nuQvtHTqKrLmgu99"],"keywords":["hashing","content reuse","wikipedia","copying"],"search_terms":["web","scale","content","reuse","detection","extended","ardi","heidemann"],"title":"Web-scale Content Reuse Detection (extended)","year":2014}