Web-scale Content Reuse Detection (extended). Ardi, C. & Heidemann, J. Technical Report ISI-TR-2014-692, USC/Information Sciences Institute, June, 2014.
Web-scale Content Reuse Detection (extended) [link]Paper  abstract   bibtex   
With the vast amount of accessible, online content, it is not surprising that unscrupulous entities ``borrow'' from the web to provide filler for advertisements, link farms, and spam and make a quick profit. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically \emphdiscover previously unknown duplicate content in the web, and the second to \emphdetect copies of discovered or manually identified content in the web. Our detection can also \emphbad neighborhoods, clusters of pages where copied content is frequent. We verify our approach with controlled experiments with two large datasets: a Common Crawl subset the web, and a copy of Geocities, an older set of user-provided web content. We then demonstrate that we can discover otherwise unknown examples of duplication for spam, and detect both discovered and expert-identified content in these large datasets. Utilizing an original copy of Wikipedia as identified content, we find 40 sites that reuse this content, 86% for commercial benefit.
@TechReport{Ardi14a,
	author = 	"Calvin Ardi and John Heidemann",
	title = "Web-scale Content Reuse Detection (extended)",
	institution = 	"USC/Information Sciences Institute",
	year = 		2014,
	sortdate = 		"2014-06-01",
	number =	"ISI-TR-2014-692",
	month =		jun,
	jlocation =	"johnh: pafile",
	keywords =	"hashing, content reuse, wikipedia, copying",
	url =		"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.html",
	pdfurl =	"https://ant.isi.edu/%7ejohnh/PAPERS/Ardi14a.pdf",
	otherurl = "ftp://ftp.isi.edu/isi-pubs/tr-692.pdf",
	myorganization =	"USC/Information Sciences Institute",
	copyrightholder = "authors",
	project = "ant, mega",
	abstract = "
With the vast amount of accessible, online content, it is not
surprising that unscrupulous entities ``borrow'' from the web to
provide filler for advertisements, link farms, and spam and make a
quick profit.  Our insight is that cryptographic hashing and
fingerprinting can efficiently identify content reuse for web-size
corpora.  We develop two related algorithms, one to 
automatically \emph{discover} previously unknown 
duplicate content in the web, and
the second to \emph{detect} copies of discovered or manually
identified content in the web.  Our detection can also \emph{bad
neighborhoods}, clusters of pages where copied content is frequent.
We verify our approach with controlled experiments with two large
datasets: a Common Crawl subset the web, and a copy of Geocities, an
older set of user-provided web content.  We then demonstrate that we
can discover otherwise unknown examples of duplication for spam, and
detect both discovered and expert-identified content in these large
datasets.  Utilizing an original copy of Wikipedia as identified
content, we find 40 sites that reuse this content, 86\% for commercial
benefit.
",
}

Downloads: 0