Robust, lexicalized native language identification. Brooke, J. & Hirst, G. In Proceedings, 24th International Conference on Computational Linguistics (COLING-2012), Mumbai, December, 2012.
abstract   bibtex   
Previous approaches to the task of native language identification (Koppel et al., 2005) have been limited to small, within-corpus evaluations. Because these are restrictive and unreliable, we apply cross-corpus evaluation to the task. We demonstrate the efficacy of lexical features, which had previously been avoided due to the within-corpus topic confounds, and provide a detailed evaluation of various options, including a simple bias adaptation technique and a number of classifier algorithms. Using a new web corpus as a training set, we reach high classification accuracy for a 7-language task, performance which is robust across two independent test sets. Although we show that even higher accuracy is possible using cross-validation, we present strong evidence calling into question the validity of cross-validation evaluation using the standard dataset.
@InProceedings{	  brooke12,
  author	= {Julian Brooke and Graeme Hirst},
  title		= {Robust, lexicalized native language identification},
  booktitle	= {Proceedings, 24th International Conference on
		  Computational Linguistics (COLING-2012)},
  year		= 2012,
  address	= {Mumbai},
  month		= {December},
  abstract	= {Previous approaches to the task of native language
		  identification (Koppel et al., 2005) have been limited to
		  small, within-corpus evaluations. Because these are
		  restrictive and unreliable, we apply cross-corpus
		  evaluation to the task. We demonstrate the efficacy of
		  lexical features, which had previously been avoided due to
		  the within-corpus topic confounds, and provide a detailed
		  evaluation of various options, including a simple bias
		  adaptation technique and a number of classifier algorithms.
		  Using a new web corpus as a training set, we reach high
		  classification accuracy for a 7-language task, performance
		  which is robust across two independent test sets. Although
		  we show that even higher accuracy is possible using
		  cross-validation, we present strong evidence calling into
		  question the validity of cross-validation evaluation using
		  the standard dataset.},
  download	= {http://ftp.cs.toronto.edu/pub/gh/Brooke+Hirst-COLING-2012.pdf}
		  
}

Downloads: 0