Paragraph Clustering for Intrinsic Plagiarism Detection using a Stylistic Vector-Space Model with Extrinsic Features. Brooke, J. & Hirst, G. In Proceedings, PAN 2012 Lab: Uncovering Plagiarism, Authorship and Social Software Misuse — at the CLEF 2012 Conference and Labs of the Evaluation Forum: Information Access Evaluation meets Multilinguality, Multimodality, and Visual Analytics, Rome, September, 2012.
abstract   bibtex   
Our approach to the task of intrinsic plagiarism detection uses a vector-space model which eschews surface features in favor of richer extrinsic features, including those based on latent semantic analysis in a larger external corpus. We posit that the popularity and success of surface n-gram features is mostly due to the topic-biased nature of current artificial evaluations, a problem which unfortunately extends to the present PAN evaluation. One interesting of aspect of our approach is our way of dealing with small, imbalanced span sizes; we improved performance considerably in our development evaluation by countering these effect using the expected difference of sums of random variables.
@InProceedings{	  brooke11,
  author	= {Julian Brooke and Graeme Hirst},
  title		= {Paragraph Clustering for Intrinsic Plagiarism Detection
		  using a Stylistic Vector-Space Model with Extrinsic
		  Features},
  booktitle	= {Proceedings, {PAN} 2012 Lab: {U}ncovering Plagiarism,
		  Authorship and Social Software Misuse --- at the {CLEF}
		  2012 Conference and Labs of the Evaluation Forum:
		  Information Access Evaluation meets Multilinguality,
		  Multimodality, and Visual Analytics},
  year		= 2012,
  address	= {Rome},
  month		= {September},
  abstract	= {Our approach to the task of intrinsic plagiarism detection
		  uses a vector-space model which eschews surface features in
		  favor of richer extrinsic features, including those based
		  on latent semantic analysis in a larger external corpus. We
		  posit that the popularity and success of surface n-gram
		  features is mostly due to the topic-biased nature of
		  current artificial evaluations, a problem which
		  unfortunately extends to the present PAN evaluation. One
		  interesting of aspect of our approach is our way of dealing
		  with small, imbalanced span sizes; we improved performance
		  considerably in our development evaluation by countering
		  these effect using the expected difference of sums of
		  random variables.},
  download	= {http://ftp.cs.toronto.edu/pub/gh/Brooke+Hirst-PAN-2012.pdf}
		  
}

Downloads: 0