Precise Detection of Content Reuse in the Web

Precise Detection of Content Reuse in the Web. Ardi, C. & Heidemann, J. ACM Computer Communication Review, 49(2):9–24, ACM, New York, NY, USA, April, 2019.

Paper doi abstract bibtex

With vast amount of content online, it is not surprising that unscrupulous entities "borrow" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6–11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.

@article{Ardi19a,
	author          = {Ardi, Calvin and Heidemann, John},
	title           = {Precise Detection of Content Reuse in the Web},
	journal         = "ACM Computer Communication Review",
	project         = "ant, mega",
	sortdate        = "2019-05-22",
	issue_date      = {April 2019},
	volume          = {49},
	number          = {2},
	month           = apr,
	year            = {2019},
	issn            = {0146-4833},
	pages           = {9--24},
	numpages        = {16},
	url             = "http://doi.acm.org/10.1145/3336937.3336940",
	pdfurl          = "https://ccronline.sigcomm.org/wp-content/uploads/2019/05/acmdl19-299.pdf",
	blogurl         = "https://ant.isi.edu/blog/?p=1311",
	doi             = "https://doi.org/10.1145/3336937.3336940",
	acmid           = {3336940},
	publisher       = "ACM",
	address         = {New York, NY, USA},
	keywords        = {content duplication, content reuse, duplicate detection, phishing},
	institution     = "USC/Information Sciences Institute",
	myorganization  = "USC/Information Sciences Institute",
	copyrightholder = "authors",
	abstract        = {
With vast amount of content online, it is not surprising that unscrupulous
entities "borrow" from the web to provide content for advertisements, link
farms, and spam. Our insight is that cryptographic hashing and fingerprinting
can efficiently identify content reuse for web-size corpora. We develop two
related algorithms, one to automatically *discover* previously unknown
duplicate content in the web, and the second to *precisely detect* copies of
discovered or manually identified content. We show that *bad neighborhoods*,
clusters of pages where copied content is frequent, help identify copying in
the web. We verify our algorithm and its choices with controlled experiments
over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a
phishing corpus (2014). We show that our use of cryptographic hashing is much
more precise than alternatives such as locality-sensitive hashing, avoiding the
thousands of false-positives that would otherwise occur. We apply our approach
in three systems: discovering and detecting duplicated content in the web,
searching explicitly for copies of Wikipedia in the web, and detecting phishing
sites in a web browser. We show that general copying in the web is often benign
(for example, templates), but 6–11% are commercial or possibly commercial. Most
copies of Wikipedia (86%) are commercialized (link farming or advertisements).
For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without
taking on intentional cloaking.
},
}

Downloads: 0

{"_id":"WL4DEor2EPEGcpL8Q","bibbaseid":"ardi-heidemann-precisedetectionofcontentreuseintheweb-2019","author_short":["Ardi, C.","Heidemann, J."],"bibdata":{"bibtype":"article","type":"article","author":[{"propositions":[],"lastnames":["Ardi"],"firstnames":["Calvin"],"suffixes":[]},{"propositions":[],"lastnames":["Heidemann"],"firstnames":["John"],"suffixes":[]}],"title":"Precise Detection of Content Reuse in the Web","journal":"ACM Computer Communication Review","project":"ant, mega","sortdate":"2019-05-22","issue_date":"April 2019","volume":"49","number":"2","month":"April","year":"2019","issn":"0146-4833","pages":"9–24","numpages":"16","url":"http://doi.acm.org/10.1145/3336937.3336940","pdfurl":"https://ccronline.sigcomm.org/wp-content/uploads/2019/05/acmdl19-299.pdf","blogurl":"https://ant.isi.edu/blog/?p=1311","doi":"https://doi.org/10.1145/3336937.3336940","acmid":"3336940","publisher":"ACM","address":"New York, NY, USA","keywords":"content duplication, content reuse, duplicate detection, phishing","institution":"USC/Information Sciences Institute","myorganization":"USC/Information Sciences Institute","copyrightholder":"authors","abstract":"With vast amount of content online, it is not surprising that unscrupulous entities \"borrow\" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6–11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking. ","bibtex":"@article{Ardi19a,\n\tauthor = {Ardi, Calvin and Heidemann, John},\n\ttitle = {Precise Detection of Content Reuse in the Web},\n\tjournal = \"ACM Computer Communication Review\",\n\tproject = \"ant, mega\",\n\tsortdate = \"2019-05-22\",\n\tissue_date = {April 2019},\n\tvolume = {49},\n\tnumber = {2},\n\tmonth = apr,\n\tyear = {2019},\n\tissn = {0146-4833},\n\tpages = {9--24},\n\tnumpages = {16},\n\turl = \"http://doi.acm.org/10.1145/3336937.3336940\",\n\tpdfurl = \"https://ccronline.sigcomm.org/wp-content/uploads/2019/05/acmdl19-299.pdf\",\n\tblogurl = \"https://ant.isi.edu/blog/?p=1311\",\n\tdoi = \"https://doi.org/10.1145/3336937.3336940\",\n\tacmid = {3336940},\n\tpublisher = \"ACM\",\n\taddress = {New York, NY, USA},\n\tkeywords = {content duplication, content reuse, duplicate detection, phishing},\n\tinstitution = \"USC/Information Sciences Institute\",\n\tmyorganization = \"USC/Information Sciences Institute\",\n\tcopyrightholder = \"authors\",\n\tabstract = {\nWith vast amount of content online, it is not surprising that unscrupulous\nentities \"borrow\" from the web to provide content for advertisements, link\nfarms, and spam. Our insight is that cryptographic hashing and fingerprinting\ncan efficiently identify content reuse for web-size corpora. We develop two\nrelated algorithms, one to automatically *discover* previously unknown\nduplicate content in the web, and the second to *precisely detect* copies of\ndiscovered or manually identified content. We show that *bad neighborhoods*,\nclusters of pages where copied content is frequent, help identify copying in\nthe web. We verify our algorithm and its choices with controlled experiments\nover three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a\nphishing corpus (2014). We show that our use of cryptographic hashing is much\nmore precise than alternatives such as locality-sensitive hashing, avoiding the\nthousands of false-positives that would otherwise occur. We apply our approach\nin three systems: discovering and detecting duplicated content in the web,\nsearching explicitly for copies of Wikipedia in the web, and detecting phishing\nsites in a web browser. We show that general copying in the web is often benign\n(for example, templates), but 6–11% are commercial or possibly commercial. Most\ncopies of Wikipedia (86%) are commercialized (link farming or advertisements).\nFor phishing, we focus on PayPal, detecting 59% of PayPal-phish even without\ntaking on intentional cloaking.\n},\n}\n\n","author_short":["Ardi, C.","Heidemann, J."],"bibbaseid":"ardi-heidemann-precisedetectionofcontentreuseintheweb-2019","role":"author","urls":{"Paper":"http://doi.acm.org/10.1145/3336937.3336940"},"keyword":["content duplication","content reuse","duplicate detection","phishing"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://bibbase.org/f/dHevizJoWEhWowz8q/johnh-2023-2.bib","dataSources":["YLyu3mj3xsBeoqiHK","sz46kuqqKmGQiNuh4","fLZcDgNSoSuatv6aX","tAs2bxgkkLZB7xfoZ","fxEParwu2ZfurScPY","7nuQvtHTqKrLmgu99"],"keywords":["content duplication","content reuse","duplicate detection","phishing"],"search_terms":["precise","detection","content","reuse","web","ardi","heidemann"],"title":"Precise Detection of Content Reuse in the Web","year":2019}