Building a Lexicon of Formulaic Language for Language Learners. Brooke, J., Hammond, A., Jacob, D., Tsang, V., Hirst, G., & Shein, F. In Proceedings, 11th Workshop on Multiword Expressions, pages 96--104, Denver, Colorado, June, 2015.
abstract   bibtex   
Though the multiword lexicon has long been of interest in computational linguistics, most relevant work is targeted at only a small portion of it. Our work is motivated by the needs of learners for more comprehensive resources reflecting formulaic language that goes beyond what is likely to be codified in a dictionary. Working from an initial sequential segmentation approach, we present two enhancements: the use of a new measure to promote the identification of lexicalized sequences, and an expansion to include sequences with gaps. We evaluate using a novel method that allows us to calculate an estimate of recall without a reference lexicon, showing that good performance in the second enhancement depends crucially on the first, and that our lexicon conforms much more with human judgment of formulaic language than alternatives.
@inproceedings{Brookeetal2015MWE,
   author = {Julian Brooke and Adam Hammond and David Jacob and Vivian
                  Tsang  and Graeme Hirst and Fraser Shein},
   title = {Building a Lexicon of Formulaic Language for Language Learners},
   address = {Denver, Colorado},
   booktitle = {Proceedings, 11th Workshop on Multiword Expressions},
   pages = {96--104},
   year = {2015},
   month = {June},
   download = {http://ftp.cs.toronto.edu/pub/gh/Brooke-etal-2015-MWE.pdf},
   abstract = { Though the multiword lexicon has long been of interest
                  in computational linguistics, most relevant work is
                  targeted at only a small portion of it. Our work is
                  motivated by the needs of learners for more
                  comprehensive resources reflecting formulaic
                  language that goes beyond what is likely to be
                  codified in a dictionary. Working from an initial
                  sequential segmentation approach, we present two
                  enhancements: the use of a new measure to promote
                  the identification of lexicalized sequences, and an
                  expansion to include sequences with gaps. We
                  evaluate using a novel method that allows us to
                  calculate an estimate of recall without a reference
                  lexicon, showing that good performance in the second
                  enhancement depends crucially on the first, and that
                  our lexicon conforms much more with human judgment
                  of formulaic language than alternatives.}
}

Downloads: 0