Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Detecting Cross-Language Plagiarism using Open Knowledge Graphs. Stegmueller, J., Bauer-Marquart, F., Meuschke, N., Ruas, T., Schubotz, M., & Gipp, B. In 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021) at the ACM/IEEE Joint Conference on Digital Libraries 2021 (JCDL2021), Virtual Event, September, 2021. ACM.

Paper

Detecting Cross-Language Plagiarism using Open Knowledge Graphs [link]

Code/data abstract bibtex 4 downloads

Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.

@inproceedings{Stegmueller2021,
	address = {Virtual Event},
	title = {Detecting {Cross}-{Language} {Plagiarism} using {Open} {Knowledge} {Graphs}},
	url = {paper=https://www.gipp.com/wp-content/papercite-data/pdf/stegmueller2021.pdf code/data=https://doi.org/10.5281/zenodo.5159398},
	abstract = {Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.},
	booktitle = {2nd {Workshop} on {Extraction} and {Evaluation} of {Knowledge} {Entities} from {Scientific} {Documents} ({EEKE2021}) at the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} 2021 ({JCDL2021})},
	publisher = {ACM},
	author = {Stegmueller, Johannes and Bauer-Marquart, Fabian and Meuschke, Norman and Ruas, Terry and Schubotz, Moritz and Gipp, Bela},
	month = sep,
	year = {2021},
}

Downloads: 4

{"_id":"o47G9F8Fa9kvRBJfS","bibbaseid":"stegmueller-bauermarquart-meuschke-ruas-schubotz-gipp-detectingcrosslanguageplagiarismusingopenknowledgegraphs-2021","author_short":["Stegmueller, J.","Bauer-Marquart, F.","Meuschke, N.","Ruas, T.","Schubotz, M.","Gipp, B."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Virtual Event","title":"Detecting Cross-Language Plagiarism using Open Knowledge Graphs","abstract":"Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.","booktitle":"2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021) at the ACM/IEEE Joint Conference on Digital Libraries 2021 (JCDL2021)","publisher":"ACM","author":[{"propositions":[],"lastnames":["Stegmueller"],"firstnames":["Johannes"],"suffixes":[]},{"propositions":[],"lastnames":["Bauer-Marquart"],"firstnames":["Fabian"],"suffixes":[]},{"propositions":[],"lastnames":["Meuschke"],"firstnames":["Norman"],"suffixes":[]},{"propositions":[],"lastnames":["Ruas"],"firstnames":["Terry"],"suffixes":[]},{"propositions":[],"lastnames":["Schubotz"],"firstnames":["Moritz"],"suffixes":[]},{"propositions":[],"lastnames":["Gipp"],"firstnames":["Bela"],"suffixes":[]}],"month":"September","year":"2021","bibtex":"@inproceedings{Stegmueller2021,\n\taddress = {Virtual Event},\n\ttitle = {Detecting {Cross}-{Language} {Plagiarism} using {Open} {Knowledge} {Graphs}},\n\turl = {paper=https://www.gipp.com/wp-content/papercite-data/pdf/stegmueller2021.pdf code/data=https://doi.org/10.5281/zenodo.5159398},\n\tabstract = {Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.},\n\tbooktitle = {2nd {Workshop} on {Extraction} and {Evaluation} of {Knowledge} {Entities} from {Scientific} {Documents} ({EEKE2021}) at the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} 2021 ({JCDL2021})},\n\tpublisher = {ACM},\n\tauthor = {Stegmueller, Johannes and Bauer-Marquart, Fabian and Meuschke, Norman and Ruas, Terry and Schubotz, Moritz and Gipp, Bela},\n\tmonth = sep,\n\tyear = {2021},\n}\n\n\n\n","author_short":["Stegmueller, J.","Bauer-Marquart, F.","Meuschke, N.","Ruas, T.","Schubotz, M.","Gipp, B."],"urlpaper":"https://www.gipp.com/wp-content/papercite-data/pdf/stegmueller2021.pdf","urlcode/data":"https://doi.org/10.5281/zenodo.5159398","key":"Stegmueller2021","id":"Stegmueller2021","bibbaseid":"stegmueller-bauermarquart-meuschke-ruas-schubotz-gipp-detectingcrosslanguageplagiarismusingopenknowledgegraphs-2021","role":"author","urls":{"Paper":"https://www.gipp.com/wp-content/papercite-data/pdf/stegmueller2021.pdf","Code/data":"https://doi.org/10.5281/zenodo.5159398"},"metadata":{"authorlinks":{}},"downloads":4},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero-group/nmeuschke/2532143","dataSources":["FPjHiAkAja6XvmScK","RTGAqwGfLTSqYQMsS","Y7kZGjoN5Erk3Lo2J","yM7MefT3mRkY9m7i4","jnWJCpbQCoWvxj9kz","F32umBkhFrpeJbp7A","BWzEyLkMvdMGpHpr6","hBAe6Z5DsNbrQtje2","e3AdWzdxYmb85Fn5D","MtqPmSRuq4X8FJqNT","YCwvFifyPbazBYMQD","6oZMeYhGKA2Mp8xhF","gYMS6DBXsNosXKcRC","bQwdfx3o8Q3vnsqfH","SzFkcrpurPzNHEyqX","dHLtmS5G7GmooD755","EvZZTzAZvA3EsuMjm","ajaQNNgWhEmTout8A"],"keywords":[],"search_terms":["detecting","cross","language","plagiarism","using","open","knowledge","graphs","stegmueller","bauer-marquart","meuschke","ruas","schubotz","gipp"],"title":"Detecting Cross-Language Plagiarism using Open Knowledge Graphs","year":2021,"downloads":4}