SnapCode - A Snapshot Based Approach to Code Stylometry

SnapCode - A Snapshot Based Approach to Code Stylometry. Sarnot, S. A. P., Rinke, S., Raimalwalla, R., Joshi, R., Khengare, R., & Goel, P. In 2019 International Conference on Information Technology (ICIT), pages 337–341, December, 2019.
doi abstract bibtex

Artificial neural networks have seen significant advancements in recent times with the growing popularity of deep learning. Deep learning allows us to learn representations that are otherwise difficult to extract and helps in better classification tasks. Images, videos and speech processing are the major areas where deep learning is applied. Our work is related to the application of deep learning to source codes. Previous works in this domain have failed to easily capture structural and behavioral aspects of the code. Thereby relying on manual feature engineering for applications like author identification, code quality analysis, cyber-attack investigation, malware recognition and plagiarism detection. We propose a novel approach to capture these feature representations by processing snapshots of code instead of processing source code token by token. We, therefore, propose SnapCode, a snapshot-based approach to extract deep convolutional features from text which would otherwise be impossible using currently known approaches. SnapCode uses a deep convolutional neural network coupled with transfer learning to learn the structural representation of the source code. We show that simple networks fail to learn these features and deep network coupled with transfer learning gives us the best results. SnapCode can capture behavioral aspects of source code as we employ it to the task of author detection, also known as "code stylometry". We choose author detection to validate our approach as it requires most number of manual and complicated features. Although source code is simply text, we aim to process text data in a way similar to humans and show that we could learn meaningful representations.

@inproceedings{sarnot_snapcode_2019,
	title = {{SnapCode} - {A} {Snapshot} {Based} {Approach} to {Code} {Stylometry}},
	doi = {10.1109/ICIT48102.2019.00066},
	abstract = {Artificial neural networks have seen significant advancements in recent times with the growing popularity of deep learning. Deep learning allows us to learn representations that are otherwise difficult to extract and helps in better classification tasks. Images, videos and speech processing are the major areas where deep learning is applied. Our work is related to the application of deep learning to source codes. Previous works in this domain have failed to easily capture structural and behavioral aspects of the code. Thereby relying on manual feature engineering for applications like author identification, code quality analysis, cyber-attack investigation, malware recognition and plagiarism detection. We propose a novel approach to capture these feature representations by processing snapshots of code instead of processing source code token by token. We, therefore, propose SnapCode, a snapshot-based approach to extract deep convolutional features from text which would otherwise be impossible using currently known approaches. SnapCode uses a deep convolutional neural network coupled with transfer learning to learn the structural representation of the source code. We show that simple networks fail to learn these features and deep network coupled with transfer learning gives us the best results. SnapCode can capture behavioral aspects of source code as we employ it to the task of author detection, also known as "code stylometry". We choose author detection to validate our approach as it requires most number of manual and complicated features. Although source code is simply text, we aim to process text data in a way similar to humans and show that we could learn meaningful representations.},
	booktitle = {2019 {International} {Conference} on {Information} {Technology} ({ICIT})},
	author = {Sarnot, Saloni Alias Puja and Rinke, Sanjana and Raimalwalla, Rayomand and Joshi, Raviraj and Khengare, Rahul and Goel, Purvi},
	month = dec,
	year = {2019},
	keywords = {\#broken, Code Stylometry, Computer languages, Convolutional Neural Network, Convolutional neural networks, Feature extraction, Image Processing, Jab/\#ICIT, Machine learning, Manuals, Syntactics, Task analysis, Transfer learning},
	pages = {337--341},
}

Downloads: 0

{"_id":"5SYcRawbwwQqrm8zz","bibbaseid":"sarnot-rinke-raimalwalla-joshi-khengare-goel-snapcodeasnapshotbasedapproachtocodestylometry-2019","author_short":["Sarnot, S. A. P.","Rinke, S.","Raimalwalla, R.","Joshi, R.","Khengare, R.","Goel, P."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"SnapCode - A Snapshot Based Approach to Code Stylometry","doi":"10.1109/ICIT48102.2019.00066","abstract":"Artificial neural networks have seen significant advancements in recent times with the growing popularity of deep learning. Deep learning allows us to learn representations that are otherwise difficult to extract and helps in better classification tasks. Images, videos and speech processing are the major areas where deep learning is applied. Our work is related to the application of deep learning to source codes. Previous works in this domain have failed to easily capture structural and behavioral aspects of the code. Thereby relying on manual feature engineering for applications like author identification, code quality analysis, cyber-attack investigation, malware recognition and plagiarism detection. We propose a novel approach to capture these feature representations by processing snapshots of code instead of processing source code token by token. We, therefore, propose SnapCode, a snapshot-based approach to extract deep convolutional features from text which would otherwise be impossible using currently known approaches. SnapCode uses a deep convolutional neural network coupled with transfer learning to learn the structural representation of the source code. We show that simple networks fail to learn these features and deep network coupled with transfer learning gives us the best results. SnapCode can capture behavioral aspects of source code as we employ it to the task of author detection, also known as \"code stylometry\". We choose author detection to validate our approach as it requires most number of manual and complicated features. Although source code is simply text, we aim to process text data in a way similar to humans and show that we could learn meaningful representations.","booktitle":"2019 International Conference on Information Technology (ICIT)","author":[{"propositions":[],"lastnames":["Sarnot"],"firstnames":["Saloni","Alias","Puja"],"suffixes":[]},{"propositions":[],"lastnames":["Rinke"],"firstnames":["Sanjana"],"suffixes":[]},{"propositions":[],"lastnames":["Raimalwalla"],"firstnames":["Rayomand"],"suffixes":[]},{"propositions":[],"lastnames":["Joshi"],"firstnames":["Raviraj"],"suffixes":[]},{"propositions":[],"lastnames":["Khengare"],"firstnames":["Rahul"],"suffixes":[]},{"propositions":[],"lastnames":["Goel"],"firstnames":["Purvi"],"suffixes":[]}],"month":"December","year":"2019","keywords":"#broken, Code Stylometry, Computer languages, Convolutional Neural Network, Convolutional neural networks, Feature extraction, Image Processing, Jab/#ICIT, Machine learning, Manuals, Syntactics, Task analysis, Transfer learning","pages":"337–341","bibtex":"@inproceedings{sarnot_snapcode_2019,\n\ttitle = {{SnapCode} - {A} {Snapshot} {Based} {Approach} to {Code} {Stylometry}},\n\tdoi = {10.1109/ICIT48102.2019.00066},\n\tabstract = {Artificial neural networks have seen significant advancements in recent times with the growing popularity of deep learning. Deep learning allows us to learn representations that are otherwise difficult to extract and helps in better classification tasks. Images, videos and speech processing are the major areas where deep learning is applied. Our work is related to the application of deep learning to source codes. Previous works in this domain have failed to easily capture structural and behavioral aspects of the code. Thereby relying on manual feature engineering for applications like author identification, code quality analysis, cyber-attack investigation, malware recognition and plagiarism detection. We propose a novel approach to capture these feature representations by processing snapshots of code instead of processing source code token by token. We, therefore, propose SnapCode, a snapshot-based approach to extract deep convolutional features from text which would otherwise be impossible using currently known approaches. SnapCode uses a deep convolutional neural network coupled with transfer learning to learn the structural representation of the source code. We show that simple networks fail to learn these features and deep network coupled with transfer learning gives us the best results. SnapCode can capture behavioral aspects of source code as we employ it to the task of author detection, also known as \"code stylometry\". We choose author detection to validate our approach as it requires most number of manual and complicated features. Although source code is simply text, we aim to process text data in a way similar to humans and show that we could learn meaningful representations.},\n\tbooktitle = {2019 {International} {Conference} on {Information} {Technology} ({ICIT})},\n\tauthor = {Sarnot, Saloni Alias Puja and Rinke, Sanjana and Raimalwalla, Rayomand and Joshi, Raviraj and Khengare, Rahul and Goel, Purvi},\n\tmonth = dec,\n\tyear = {2019},\n\tkeywords = {\\#broken, Code Stylometry, Computer languages, Convolutional Neural Network, Convolutional neural networks, Feature extraction, Image Processing, Jab/\\#ICIT, Machine learning, Manuals, Syntactics, Task analysis, Transfer learning},\n\tpages = {337--341},\n}\n\n\n\n","author_short":["Sarnot, S. A. P.","Rinke, S.","Raimalwalla, R.","Joshi, R.","Khengare, R.","Goel, P."],"key":"sarnot_snapcode_2019","id":"sarnot_snapcode_2019","bibbaseid":"sarnot-rinke-raimalwalla-joshi-khengare-goel-snapcodeasnapshotbasedapproachtocodestylometry-2019","role":"author","urls":{},"keyword":["#broken","Code Stylometry","Computer languages","Convolutional Neural Network","Convolutional neural networks","Feature extraction","Image Processing","Jab/#ICIT","Machine learning","Manuals","Syntactics","Task analysis","Transfer learning"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/bxt101","dataSources":["Wsv2bQ4jPuc7qme8R","6TCByJKPzhQKY477b"],"keywords":["#broken","code stylometry","computer languages","convolutional neural network","convolutional neural networks","feature extraction","image processing","jab/#icit","machine learning","manuals","syntactics","task analysis","transfer learning"],"search_terms":["snapcode","snapshot","based","approach","code","stylometry","sarnot","rinke","raimalwalla","joshi","khengare","goel"],"title":"SnapCode - A Snapshot Based Approach to Code Stylometry","year":2019}