A comparative study of machine learning methods for authorship attribution

A comparative study of machine learning methods for authorship attribution. Jockers, M. L. & Witten, D. M. Chronos, 25(2):215–223, June, 2010. 🏷️ /unread、AnalyzeStatistically、t_Stylometry、meta_Theorizing、bigdata、t_MachineLearning

Paper doi abstract bibtex

We compare and benchmark the performance of five classification methods, four of which are taken from the machine learning literature, in a classic authorship attribution problem involving the Federalist Papers. Cross-validation results are reported for each method, and each method is further employed in classifying the disputed papers and the few papers that are generally understood to be coauthored. These tests are performed using two separate feature sets: a “raw” feature set containing all words and word bigrams that are common to all of the authors, and a second “pre-processed” feature set derived by reducing the raw feature set to include only words meeting a minimum relative frequency threshold. Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors. 【摘要翻译】在涉及《联邦党人文集》的经典作者归属问题中，我们对五种分类方法（其中四种来自机器学习文献）的性能进行了比较和基准测试。我们报告了每种方法的交叉验证结果，并进一步使用每种方法对有争议的论文和少数几篇被普遍认为是合著的论文进行了分类。这些测试使用两个不同的特征集进行：一个是 "原始 "特征集，包含所有作者共有的所有单词和单词重联；另一个是 "预处理 "特征集，通过减少原始特征集，只包含符合最小相对频率阈值的单词。所测试的每种方法都表现出色，但最近缩减中心法和正则化判别分析的总体表现最好，交叉验证误差为 0/70。

@article{jockers2010,
	title = {A comparative study of machine learning methods for authorship attribution},
	volume = {25},
	shorttitle = {作者归属的机器学习方法比较研究},
	url = {http://llc.oxfordjournals.org/content/25/2/215.abstract},
	doi = {10.1093/llc/fqq001},
	abstract = {We compare and benchmark the performance of five classification methods, four of which are taken from the machine learning literature, in a classic authorship attribution problem involving the Federalist Papers. Cross-validation results are reported for each method, and each method is further employed in classifying the disputed papers and the few papers that are generally understood to be coauthored. These tests are performed using two separate feature sets: a “raw” feature set containing all words and word bigrams that are common to all of the authors, and a second “pre-processed” feature set derived by reducing the raw feature set to include only words meeting a minimum relative frequency threshold. Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors.

【摘要翻译】在涉及《联邦党人文集》的经典作者归属问题中，我们对五种分类方法（其中四种来自机器学习文献）的性能进行了比较和基准测试。我们报告了每种方法的交叉验证结果，并进一步使用每种方法对有争议的论文和少数几篇被普遍认为是合著的论文进行了分类。这些测试使用两个不同的特征集进行：一个是 "原始 "特征集，包含所有作者共有的所有单词和单词重联；另一个是 "预处理 "特征集，通过减少原始特征集，只包含符合最小相对频率阈值的单词。所测试的每种方法都表现出色，但最近缩减中心法和正则化判别分析的总体表现最好，交叉验证误差为 0/70。},
	language = {en},
	number = {2},
	urldate = {2011-12-14},
	journal = {Chronos},
	author = {Jockers, Matthew L. and Witten, Daniela M.},
	month = jun,
	year = {2010},
	note = {🏷️ /unread、AnalyzeStatistically、t\_Stylometry、meta\_Theorizing、bigdata、t\_MachineLearning},
	keywords = {/unread, AnalyzeStatistically, bigdata, meta\_Theorizing, t\_MachineLearning, t\_Stylometry},
	pages = {215--223},
}

Downloads: 0

{"_id":"wZ2nFgPGrhQ5Ekp8b","bibbaseid":"jockers-witten-acomparativestudyofmachinelearningmethodsforauthorshipattribution-2010","author_short":["Jockers, M. L.","Witten, D. M."],"bibdata":{"bibtype":"article","type":"article","title":"A comparative study of machine learning methods for authorship attribution","volume":"25","shorttitle":"作者归属的机器学习方法比较研究","url":"http://llc.oxfordjournals.org/content/25/2/215.abstract","doi":"10.1093/llc/fqq001","abstract":"We compare and benchmark the performance of five classification methods, four of which are taken from the machine learning literature, in a classic authorship attribution problem involving the Federalist Papers. Cross-validation results are reported for each method, and each method is further employed in classifying the disputed papers and the few papers that are generally understood to be coauthored. These tests are performed using two separate feature sets: a “raw” feature set containing all words and word bigrams that are common to all of the authors, and a second “pre-processed” feature set derived by reducing the raw feature set to include only words meeting a minimum relative frequency threshold. Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors. 【摘要翻译】在涉及《联邦党人文集》的经典作者归属问题中，我们对五种分类方法（其中四种来自机器学习文献）的性能进行了比较和基准测试。我们报告了每种方法的交叉验证结果，并进一步使用每种方法对有争议的论文和少数几篇被普遍认为是合著的论文进行了分类。这些测试使用两个不同的特征集进行：一个是 \"原始 \"特征集，包含所有作者共有的所有单词和单词重联；另一个是 \"预处理 \"特征集，通过减少原始特征集，只包含符合最小相对频率阈值的单词。所测试的每种方法都表现出色，但最近缩减中心法和正则化判别分析的总体表现最好，交叉验证误差为 0/70。","language":"en","number":"2","urldate":"2011-12-14","journal":"Chronos","author":[{"propositions":[],"lastnames":["Jockers"],"firstnames":["Matthew","L."],"suffixes":[]},{"propositions":[],"lastnames":["Witten"],"firstnames":["Daniela","M."],"suffixes":[]}],"month":"June","year":"2010","note":"🏷️ /unread、AnalyzeStatistically、t_Stylometry、meta_Theorizing、bigdata、t_MachineLearning","keywords":"/unread, AnalyzeStatistically, bigdata, meta_Theorizing, t_MachineLearning, t_Stylometry","pages":"215–223","bibtex":"@article{jockers2010,\n\ttitle = {A comparative study of machine learning methods for authorship attribution},\n\tvolume = {25},\n\tshorttitle = {作者归属的机器学习方法比较研究},\n\turl = {http://llc.oxfordjournals.org/content/25/2/215.abstract},\n\tdoi = {10.1093/llc/fqq001},\n\tabstract = {We compare and benchmark the performance of five classification methods, four of which are taken from the machine learning literature, in a classic authorship attribution problem involving the Federalist Papers. Cross-validation results are reported for each method, and each method is further employed in classifying the disputed papers and the few papers that are generally understood to be coauthored. These tests are performed using two separate feature sets: a “raw” feature set containing all words and word bigrams that are common to all of the authors, and a second “pre-processed” feature set derived by reducing the raw feature set to include only words meeting a minimum relative frequency threshold. Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors.\n\n【摘要翻译】在涉及《联邦党人文集》的经典作者归属问题中，我们对五种分类方法（其中四种来自机器学习文献）的性能进行了比较和基准测试。我们报告了每种方法的交叉验证结果，并进一步使用每种方法对有争议的论文和少数几篇被普遍认为是合著的论文进行了分类。这些测试使用两个不同的特征集进行：一个是 \"原始 \"特征集，包含所有作者共有的所有单词和单词重联；另一个是 \"预处理 \"特征集，通过减少原始特征集，只包含符合最小相对频率阈值的单词。所测试的每种方法都表现出色，但最近缩减中心法和正则化判别分析的总体表现最好，交叉验证误差为 0/70。},\n\tlanguage = {en},\n\tnumber = {2},\n\turldate = {2011-12-14},\n\tjournal = {Chronos},\n\tauthor = {Jockers, Matthew L. and Witten, Daniela M.},\n\tmonth = jun,\n\tyear = {2010},\n\tnote = {🏷️ /unread、AnalyzeStatistically、t\\_Stylometry、meta\\_Theorizing、bigdata、t\\_MachineLearning},\n\tkeywords = {/unread, AnalyzeStatistically, bigdata, meta\\_Theorizing, t\\_MachineLearning, t\\_Stylometry},\n\tpages = {215--223},\n}\n\n","author_short":["Jockers, M. L.","Witten, D. M."],"key":"jockers2010","id":"jockers2010","bibbaseid":"jockers-witten-acomparativestudyofmachinelearningmethodsforauthorshipattribution-2010","role":"author","urls":{"Paper":"http://llc.oxfordjournals.org/content/25/2/215.abstract"},"keyword":["/unread","AnalyzeStatistically","bigdata","meta_Theorizing","t_MachineLearning","t_Stylometry"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://api.zotero.org/groups/2386895/collections/57E2Z43C/items?format=bibtex&limit=100","dataSources":["ir9qwYi57xSZaZFXh","3CBmasNnm7oj5dxfA"],"keywords":["/unread","analyzestatistically","bigdata","meta_theorizing","t_machinelearning","t_stylometry"],"search_terms":["comparative","study","machine","learning","methods","authorship","attribution","jockers","witten"],"title":"A comparative study of machine learning methods for authorship attribution","year":2010}