Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish. Savary, A. & Piskorski, J. Information Systems Journal, 2010.
Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish [pdf]Website  abstract   bibtex   
The paper investigates the accuracy of a Named Entity Recognition (NER) algorithm based on the Hidden Markov Model in the domain of Polish stock exchange reports. The task of NER was limited to the recognition and classification of Named Entities representing persons and companies. The algorithm was tested on a small Polish domain corpus of stock exchange reports. A comparison with the baselines of the algorithms based on the case of the first letters and a gazetteer is presented. The algorithm outperformed both baselines; it achieved 64% precision and 93% recall for person names and 78% precision and 83% recall for company names. Introduction of simple hand-written post-processing rules increased the precision for person names up to 87%. A cross-domain evaluation on a small corpus of police reports is also presented. We discuss the problem of method portability in relation to much worse results obtained on the second corpus. A possible combination of different knowledge sources is sketched as a possible way of overcoming the portability problem.

Downloads: 0