n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering. Sethi & J, R. Technical Report 2004. abstract bibtex In this paper, I consider a novel approach to unsolicited bulk email, or spam, filtering based on information analysis of n-Graph frequency distributions in spam vs. non-spam email. This work is dependent on a collection of over 100 spam emails harvested from the wild and classified by a human (namely, me) over the course of three months, from January to March, 2004. This corpus of spam was then integrated with non-spam email for various accuracy benchmarks. An Index of Coincidence (IC) measure is the primary tool, although filtering based on pure frequency distributions of the n-Graphs can also be used. Current techniques used for spam filtering are also evaluated. Most of these center around the MUA whereas this technique is employed in a much narrower application at the MTA level only. The most important advantage of this approach is the achievement of a 0% False Positive rate (using the current corpus). In addition, since the messages flagged utilizing this method are guaranteed to be spam, a further sophistication is added whereby the sender is mailed a spoofed unknown user message in order to invalidate the legitimate email address from the spammers’ databases. Finally, proposals are made for extending this analysis, in addition to incorporating additional innovations and expanding the purview of the application.
@techreport{ Sethi2004a,
abstract = {In this paper, I consider a novel approach to unsolicited bulk email, or spam, filtering based on information analysis of n-Graph frequency distributions in spam vs. non-spam email. This work is dependent on a collection of over 100 spam emails harvested from the wild and classified by a human (namely, me) over the course of three months, from January to March, 2004. This corpus of spam was then integrated with non-spam email for various accuracy benchmarks. An Index of Coincidence (IC) measure is the primary tool, although filtering based on pure frequency distributions of the n-Graphs can also be used. Current techniques used for spam filtering are also evaluated. Most of these center around the MUA whereas this technique is employed in a much narrower application at the MTA level only. The most important advantage of this approach is the achievement of a 0% False Positive rate (using the current corpus). In addition, since the messages flagged utilizing this method are guaranteed to be spam, a further sophistication is added whereby the sender is mailed a spoofed unknown user message in order to invalidate the legitimate email address from the spammers’ databases. Finally, proposals are made for extending this analysis, in addition to incorporating additional innovations and expanding the purview of the application.},
author = {Sethi, Ricky J},
booktitle = {UCR},
file = {:C$\backslash$:/Users/rjs/Documents/Mendeley Desktop/Sethi/UCR/Sethi_2004_n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering.pdf:pdf},
title = {{n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering}},
year = {2004}
}
Downloads: 0
{"_id":{"_str":"51f6485659ced8df4400070f"},"__v":0,"authorIDs":[],"author_short":["Sethi","J, R."],"bibbaseid":"-j-ngraphinformationcontentbasedapproachtospamuserinvalidationandredirectionfiltering-2004","bibdata":{"html":"<div class=\"bibbase_paper\">\n\n\n<span class=\"bibbase_paper_titleauthoryear\">\n\t<span class=\"bibbase_paper_title\"><a name=\"Sethi2004a\"> </a>n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering.</span>\n\t<span class=\"bibbase_paper_author\">\nSethi; and J, R.</span>\n\t<!-- <span class=\"bibbase_paper_year\">2004</span>. -->\n</span>\n\n\nTechnical Report \n 2004.\n\n\n\n<br class=\"bibbase_paper_content\"/>\n\n<span class=\"bibbase_paper_content\">\n \n \n \n <a href=\"javascript:showBib('Sethi2004a')\">\n <img src=\"http://www.bibbase.org/img/filetypes/bib.png\" \n\t alt=\"n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering [bib]\" \n\t class=\"bibbase_icon\"\n\t style=\"width: 24px; height: 24px; border: 0px; vertical-align: text-top\"><span class=\"bibbase_icon_text\">Bibtex</span></a>\n \n \n\n \n \n \n \n \n\n \n <a class=\"bibbase_abstract_link\" href=\"javascript:showAbstract('Sethi2004a')\">Abstract</a>\n \n \n</span>\n\n<!-- -->\n<!-- <div id=\"abstract_Sethi2004a\"> -->\n<!-- In this paper, I consider a novel approach to unsolicited bulk email, or spam, filtering based on information analysis of n-Graph frequency distributions in spam vs. non-spam email. This work is dependent on a collection of over 100 spam emails harvested from the wild and classified by a human (namely, me) over the course of three months, from January to March, 2004. This corpus of spam was then integrated with non-spam email for various accuracy benchmarks. An Index of Coincidence (IC) measure is the primary tool, although filtering based on pure frequency distributions of the n-Graphs can also be used. Current techniques used for spam filtering are also evaluated. Most of these center around the MUA whereas this technique is employed in a much narrower application at the MTA level only. The most important advantage of this approach is the achievement of a 0% False Positive rate (using the current corpus). In addition, since the messages flagged utilizing this method are guaranteed to be spam, a further sophistication is added whereby the sender is mailed a spoofed unknown user message in order to invalidate the legitimate email address from the spammers’ databases. Finally, proposals are made for extending this analysis, in addition to incorporating additional innovations and expanding the purview of the application. -->\n<!-- </div> -->\n<!-- -->\n\n</div>\n","downloads":0,"urls":{},"abstract":"In this paper, I consider a novel approach to unsolicited bulk email, or spam, filtering based on information analysis of n-Graph frequency distributions in spam vs. non-spam email. This work is dependent on a collection of over 100 spam emails harvested from the wild and classified by a human (namely, me) over the course of three months, from January to March, 2004. This corpus of spam was then integrated with non-spam email for various accuracy benchmarks. An Index of Coincidence (IC) measure is the primary tool, although filtering based on pure frequency distributions of the n-Graphs can also be used. Current techniques used for spam filtering are also evaluated. Most of these center around the MUA whereas this technique is employed in a much narrower application at the MTA level only. The most important advantage of this approach is the achievement of a 0% False Positive rate (using the current corpus). In addition, since the messages flagged utilizing this method are guaranteed to be spam, a further sophistication is added whereby the sender is mailed a spoofed unknown user message in order to invalidate the legitimate email address from the spammers’ databases. Finally, proposals are made for extending this analysis, in addition to incorporating additional innovations and expanding the purview of the application.","author":["Sethi","J, Ricky"],"author_short":["Sethi","J, R."],"bibtex":"@techreport{ Sethi2004a,\n abstract = {In this paper, I consider a novel approach to unsolicited bulk email, or spam, filtering based on information analysis of n-Graph frequency distributions in spam vs. non-spam email. This work is dependent on a collection of over 100 spam emails harvested from the wild and classified by a human (namely, me) over the course of three months, from January to March, 2004. This corpus of spam was then integrated with non-spam email for various accuracy benchmarks. An Index of Coincidence (IC) measure is the primary tool, although filtering based on pure frequency distributions of the n-Graphs can also be used. Current techniques used for spam filtering are also evaluated. Most of these center around the MUA whereas this technique is employed in a much narrower application at the MTA level only. The most important advantage of this approach is the achievement of a 0% False Positive rate (using the current corpus). In addition, since the messages flagged utilizing this method are guaranteed to be spam, a further sophistication is added whereby the sender is mailed a spoofed unknown user message in order to invalidate the legitimate email address from the spammers’ databases. Finally, proposals are made for extending this analysis, in addition to incorporating additional innovations and expanding the purview of the application.},\n author = {Sethi, Ricky J},\n booktitle = {UCR},\n file = {:C$\\backslash$:/Users/rjs/Documents/Mendeley Desktop/Sethi/UCR/Sethi_2004_n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering.pdf:pdf},\n title = {{n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering}},\n year = {2004}\n}","bibtype":"techreport","booktitle":"UCR","file":":C$\\backslash$:/Users/rjs/Documents/Mendeley Desktop/Sethi/UCR/Sethi_2004_n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering.pdf:pdf","id":"Sethi2004a","key":"Sethi2004a","title":"n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering","type":"techreport","year":"2004","role":"author","bibbaseid":"-j-ngraphinformationcontentbasedapproachtospamuserinvalidationandredirectionfiltering-2004"},"bibtype":"techreport","biburl":"http://www.sethi.org/papers/my_papers.bib","downloads":0,"title":"n-Graph Information Content Based Approach to Spam User Invalidation and Redirection Filtering","year":2004,"dataSources":["tdeDoF6pkYnHWtGXW"]}