Citation-based Plagiarism Detection: Practicability on a Large-Scale Scientific Corpus

Citation-based Plagiarism Detection: Practicability on a Large-Scale Scientific Corpus. Gipp, B., Meuschke, N., & Breitinger, C. Journal of the Association for Information Science and Technology, 65(8):1527–1540, August, 2014. Venue Rating: SJR Q1

Paper doi abstract bibtex

The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character-based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language-independent approach to plagiarism detection, Citation-based Plagiarism Detection (CbPD), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character-based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation-based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character-based approaches. Finally, upon combining the citation-based with the traditional character-based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.

@article{GippMB14,
	title = {Citation-based {Plagiarism} {Detection}: {Practicability} on a {Large}-{Scale} {Scientific} {Corpus}},
	volume = {65},
	issn = {2330-1635},
	url = {https://www.gipp.com/wp-content/papercite-data/pdf/gipp13b.pdf},
	doi = {10.1002/asi.23228},
	abstract = {The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character-based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language-independent approach to plagiarism detection, Citation-based Plagiarism Detection (CbPD), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character-based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation-based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character-based approaches. Finally, upon combining the citation-based with the traditional character-based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.},
	number = {8},
	journal = {Journal of the Association for Information Science and Technology},
	author = {Gipp, Bela and Meuschke, Norman and Breitinger, Corinna},
	month = aug,
	year = {2014},
	note = {Venue Rating: SJR Q1},
	keywords = {Plagiarism Detection},
	pages = {1527--1540},
}

Downloads: 0

{"_id":"8Gs5dR8EMFX5br2Aa","bibbaseid":"gipp-meuschke-breitinger-citationbasedplagiarismdetectionpracticabilityonalargescalescientificcorpus-2014","authorIDs":["3aamy24wTzcQoTPGY","7Crs4B84W7BbduMmq","97o4RCsEFAoSxEQqt","9dzP7gNRTLKvc9aPR","GYqCNzAZv2xc9nhmD","KLLNwF6yrTvRfDhAP","LKQ5pS2Y8Pc7FTkr7","TuCkHmKovwKzF3y8Z","ZDet9tokdva7KFSEH","ZJvJiH6kd887XEnz3","gBWY7RvNrDhhspCGi","nLJ4c698vfAyWRWTr","pCb6WupcebiMmhw8Y","qNrPNpAwKg5fp598G","s7Z2R2uTWDHRHN2bE","tFwG3DWb6fYeXs3sL","yiM4TojQ7StGdi2iD"],"author_short":["Gipp, B.","Meuschke, N.","Breitinger, C."],"bibdata":{"bibtype":"article","type":"article","title":"Citation-based Plagiarism Detection: Practicability on a Large-Scale Scientific Corpus","volume":"65","issn":"2330-1635","url":"https://www.gipp.com/wp-content/papercite-data/pdf/gipp13b.pdf","doi":"10.1002/asi.23228","abstract":"The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character-based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language-independent approach to plagiarism detection, Citation-based Plagiarism Detection (CbPD), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character-based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation-based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character-based approaches. Finally, upon combining the citation-based with the traditional character-based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.","number":"8","journal":"Journal of the Association for Information Science and Technology","author":[{"propositions":[],"lastnames":["Gipp"],"firstnames":["Bela"],"suffixes":[]},{"propositions":[],"lastnames":["Meuschke"],"firstnames":["Norman"],"suffixes":[]},{"propositions":[],"lastnames":["Breitinger"],"firstnames":["Corinna"],"suffixes":[]}],"month":"August","year":"2014","note":"Venue Rating: SJR Q1","keywords":"Plagiarism Detection","pages":"1527–1540","bibtex":"@article{GippMB14,\n\ttitle = {Citation-based {Plagiarism} {Detection}: {Practicability} on a {Large}-{Scale} {Scientific} {Corpus}},\n\tvolume = {65},\n\tissn = {2330-1635},\n\turl = {https://www.gipp.com/wp-content/papercite-data/pdf/gipp13b.pdf},\n\tdoi = {10.1002/asi.23228},\n\tabstract = {The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character-based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language-independent approach to plagiarism detection, Citation-based Plagiarism Detection (CbPD), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character-based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation-based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character-based approaches. Finally, upon combining the citation-based with the traditional character-based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.},\n\tnumber = {8},\n\tjournal = {Journal of the Association for Information Science and Technology},\n\tauthor = {Gipp, Bela and Meuschke, Norman and Breitinger, Corinna},\n\tmonth = aug,\n\tyear = {2014},\n\tnote = {Venue Rating: SJR Q1},\n\tkeywords = {Plagiarism Detection},\n\tpages = {1527--1540},\n}\n\n\n\n","author_short":["Gipp, B.","Meuschke, N.","Breitinger, C."],"key":"GippMB14","id":"GippMB14","bibbaseid":"gipp-meuschke-breitinger-citationbasedplagiarismdetectionpracticabilityonalargescalescientificcorpus-2014","role":"author","urls":{"Paper":"https://www.gipp.com/wp-content/papercite-data/pdf/gipp13b.pdf"},"keyword":["Plagiarism Detection"],"metadata":{"authorlinks":{"meuschke, n":"https://bibbase.org/show?bib=https%3A%2F%2Fapi.zotero.org%2Fgroups%2F2532143%2Fitems%3Fkey%3DDOjJ33bOgISaFjBIBr7jCV5S%26format%3Dbibtex%26limit%3D100"}},"downloads":0,"html":""},"bibtype":"article","biburl":"https://bibbase.org/zotero-group/nmeuschke/2532143","creationDate":"2020-04-15T13:02:33.615Z","downloads":0,"keywords":["plagiarism detection"],"search_terms":["citation","based","plagiarism","detection","practicability","large","scale","scientific","corpus","gipp","meuschke","breitinger"],"title":"Citation-based Plagiarism Detection: Practicability on a Large-Scale Scientific Corpus","year":2014,"dataSources":["9qTaLWxMN5hLpMP8m","xteq4cdC6ATE2G6Fg","JNgeyAG2vQ8k88oYh","FPjHiAkAja6XvmScK","RTGAqwGfLTSqYQMsS","Y7kZGjoN5Erk3Lo2J","yM7MefT3mRkY9m7i4","jnWJCpbQCoWvxj9kz","F32umBkhFrpeJbp7A","BWzEyLkMvdMGpHpr6","e3AdWzdxYmb85Fn5D","MtqPmSRuq4X8FJqNT","YCwvFifyPbazBYMQD","6oZMeYhGKA2Mp8xhF","gYMS6DBXsNosXKcRC","bQwdfx3o8Q3vnsqfH","SzFkcrpurPzNHEyqX","dHLtmS5G7GmooD755","EvZZTzAZvA3EsuMjm","ajaQNNgWhEmTout8A"]}