A comparative study of machine learning methods for authorship attribution. Jockers, M. L. & Witten, D. M. Literary and Linguistic Computing, 25(2):215–223, June, 2010.
A comparative study of machine learning methods for authorship attribution [link]Paper  doi  abstract   bibtex   
We compare and benchmark the performance of five classification methods, four of which are taken from the machine learning literature, in a classic authorship attribution problem involving the Federalist Papers. Cross-validation results are reported for each method, and each method is further employed in classifying the disputed papers and the few papers that are generally understood to be coauthored. These tests are performed using two separate feature sets: a “raw” feature set containing all words and word bigrams that are common to all of the authors, and a second “pre-processed” feature set derived by reducing the raw feature set to include only words meeting a minimum relative frequency threshold. Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors.
@article{jockers_comparative_2010,
	title = {A comparative study of machine learning methods for authorship attribution},
	volume = {25},
	url = {http://llc.oxfordjournals.org/content/25/2/215.abstract},
	doi = {10.1093/llc/fqq001},
	abstract = {We compare and benchmark the performance of five classification methods, four of which are taken from the machine learning literature, in a classic authorship attribution problem involving the Federalist Papers. Cross-validation results are reported for each method, and each method is further employed in classifying the disputed papers and the few papers that are generally understood to be coauthored. These tests are performed using two separate feature sets: a “raw” feature set containing all words and word bigrams that are common to all of the authors, and a second “pre-processed” feature set derived by reducing the raw feature set to include only words meeting a minimum relative frequency threshold. Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors.},
	language = {en},
	number = {2},
	urldate = {2011-12-14},
	journal = {Literary and Linguistic Computing},
	author = {Jockers, Matthew L. and Witten, Daniela M.},
	month = jun,
	year = {2010},
	keywords = {AnalyzeStatistically, bigdata, meta\_Theorizing, t\_MachineLearning, t\_Stylometry},
	pages = {215--223},
}
Downloads: 0