Automatic Generation of a Training Set for NER on Portuguese journalistic text. Teixeira, J. paginasfeuppt, 2011.
Automatic Generation of a Training Set for NER on Portuguese journalistic text [pdf]Website  abstract   bibtex   
Tracking names of public personalities from news is nowa- days impossible to be performed without semi-automatic techniques, and usually require human intervention for annotation and validation of cor- pora. The main goal of this paper is to automatically generate a training set of news for Named Entity Recognition on journalist text. This allow us to build the entire pipe-line of a NER with no human intervention. A news corpus, containing 20,000 news crawled from online newspapers, is automatically annotated with a list, extracted from Voxx, of approx- imately one thousand names of well-known and frequently mentioned people on news. Additionally, we describe examples from this corpus with a set of feature vectors that include syntactic, semantic and struc- tural information of its words. We intend to create a NER system, based on Conditional Random Fields, which is specialized for names of people. We use HAREM (an annotated corpus of Named Entities for Portuguese) as our gold-standard corpus and results obtained for the annotation of the training set have precision value of 95% and recall value of 74%. Regarding the NER system, we obtained values for precision of 78% and for recall of 23%.

Downloads: 0