Precise Detection of Content Reuse in the Web. Ardi, C. & Heidemann, J. ACM Computer Communication Review, 49(2):9–24, ACM, New York, NY, USA, April, 2019.
Precise Detection of Content Reuse in the Web [link]Paper  doi  abstract   bibtex   
With vast amount of content online, it is not surprising that unscrupulous entities "borrow" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6–11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.
@article{Ardi19a,
	author          = {Ardi, Calvin and Heidemann, John},
	title           = {Precise Detection of Content Reuse in the Web},
	journal         = "ACM Computer Communication Review",
	project         = "ant, mega",
	sortdate        = "2019-05-22",
	issue_date      = {April 2019},
	volume          = {49},
	number          = {2},
	month           = apr,
	year            = {2019},
	issn            = {0146-4833},
	pages           = {9--24},
	numpages        = {16},
	url             = "http://doi.acm.org/10.1145/3336937.3336940",
	pdfurl          = "https://ccronline.sigcomm.org/wp-content/uploads/2019/05/acmdl19-299.pdf",
	blogurl         = "https://ant.isi.edu/blog/?p=1311",
	doi             = "https://doi.org/10.1145/3336937.3336940",
	acmid           = {3336940},
	publisher       = "ACM",
	address         = {New York, NY, USA},
	keywords        = {content duplication, content reuse, duplicate detection, phishing},
	institution     = "USC/Information Sciences Institute",
	myorganization  = "USC/Information Sciences Institute",
	copyrightholder = "authors",
	abstract        = {
With vast amount of content online, it is not surprising that unscrupulous
entities "borrow" from the web to provide content for advertisements, link
farms, and spam. Our insight is that cryptographic hashing and fingerprinting
can efficiently identify content reuse for web-size corpora. We develop two
related algorithms, one to automatically *discover* previously unknown
duplicate content in the web, and the second to *precisely detect* copies of
discovered or manually identified content. We show that *bad neighborhoods*,
clusters of pages where copied content is frequent, help identify copying in
the web. We verify our algorithm and its choices with controlled experiments
over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a
phishing corpus (2014). We show that our use of cryptographic hashing is much
more precise than alternatives such as locality-sensitive hashing, avoiding the
thousands of false-positives that would otherwise occur. We apply our approach
in three systems: discovering and detecting duplicated content in the web,
searching explicitly for copies of Wikipedia in the web, and detecting phishing
sites in a web browser. We show that general copying in the web is often benign
(for example, templates), but 6–11% are commercial or possibly commercial. Most
copies of Wikipedia (86%) are commercialized (link farming or advertisements).
For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without
taking on intentional cloaking.
},
}

Downloads: 0