Automatic Generation of a Training Set for NER on Portuguese journalistic text. Teixeira, J. paginasfeuppt, 2011. Website abstract bibtex Tracking names of public personalities from news is nowa- days impossible to be performed without semi-automatic techniques, and usually require human intervention for annotation and validation of cor- pora. The main goal of this paper is to automatically generate a training set of news for Named Entity Recognition on journalist text. This allow us to build the entire pipe-line of a NER with no human intervention. A news corpus, containing 20,000 news crawled from online newspapers, is automatically annotated with a list, extracted from Voxx, of approx- imately one thousand names of well-known and frequently mentioned people on news. Additionally, we describe examples from this corpus with a set of feature vectors that include syntactic, semantic and struc- tural information of its words. We intend to create a NER system, based on Conditional Random Fields, which is specialized for names of people. We use HAREM (an annotated corpus of Named Entities for Portuguese) as our gold-standard corpus and results obtained for the annotation of the training set have precision value of 95% and recall value of 74%. Regarding the NER system, we obtained values for precision of 78% and for recall of 23%.
@article{
title = {Automatic Generation of a Training Set for NER on Portuguese journalistic text},
type = {article},
year = {2011},
keywords = {Conditional Random Fields,Journalistic text,Named Entity Recognition,Natural Language Processing,Portuguese},
pages = {25-36},
websites = {http://paginas.fe.up.pt/~prodei/dsie11/images/pdfs/s1-3.pdf},
id = {e7aceb77-0af6-34e5-92bc-66323e63846d},
created = {2011-12-28T07:04:55.000Z},
file_attached = {false},
profile_id = {5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6},
group_id = {066b42c8-f712-3fc3-abb2-225c158d2704},
last_modified = {2017-03-14T14:36:19.698Z},
tags = {named entities},
read = {false},
starred = {true},
authored = {false},
confirmed = {true},
hidden = {false},
citation_key = {Teixeira2011},
private_publication = {false},
abstract = {Tracking names of public personalities from news is nowa- days impossible to be performed without semi-automatic techniques, and usually require human intervention for annotation and validation of cor- pora. The main goal of this paper is to automatically generate a training set of news for Named Entity Recognition on journalist text. This allow us to build the entire pipe-line of a NER with no human intervention. A news corpus, containing 20,000 news crawled from online newspapers, is automatically annotated with a list, extracted from Voxx, of approx- imately one thousand names of well-known and frequently mentioned people on news. Additionally, we describe examples from this corpus with a set of feature vectors that include syntactic, semantic and struc- tural information of its words. We intend to create a NER system, based on Conditional Random Fields, which is specialized for names of people. We use HAREM (an annotated corpus of Named Entities for Portuguese) as our gold-standard corpus and results obtained for the annotation of the training set have precision value of 95% and recall value of 74%. Regarding the NER system, we obtained values for precision of 78% and for recall of 23%.},
bibtype = {article},
author = {Teixeira, Jorge},
journal = {paginasfeuppt}
}
Downloads: 0
{"_id":"qq5kHsHop2abjKDtk","bibbaseid":"teixeira-automaticgenerationofatrainingsetforneronportuguesejournalistictext-2011","authorIDs":[],"author_short":["Teixeira, J."],"bibdata":{"title":"Automatic Generation of a Training Set for NER on Portuguese journalistic text","type":"article","year":"2011","keywords":"Conditional Random Fields,Journalistic text,Named Entity Recognition,Natural Language Processing,Portuguese","pages":"25-36","websites":"http://paginas.fe.up.pt/~prodei/dsie11/images/pdfs/s1-3.pdf","id":"e7aceb77-0af6-34e5-92bc-66323e63846d","created":"2011-12-28T07:04:55.000Z","file_attached":false,"profile_id":"5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6","group_id":"066b42c8-f712-3fc3-abb2-225c158d2704","last_modified":"2017-03-14T14:36:19.698Z","tags":"named entities","read":false,"starred":"true","authored":false,"confirmed":"true","hidden":false,"citation_key":"Teixeira2011","private_publication":false,"abstract":"Tracking names of public personalities from news is nowa- days impossible to be performed without semi-automatic techniques, and usually require human intervention for annotation and validation of cor- pora. The main goal of this paper is to automatically generate a training set of news for Named Entity Recognition on journalist text. This allow us to build the entire pipe-line of a NER with no human intervention. A news corpus, containing 20,000 news crawled from online newspapers, is automatically annotated with a list, extracted from Voxx, of approx- imately one thousand names of well-known and frequently mentioned people on news. Additionally, we describe examples from this corpus with a set of feature vectors that include syntactic, semantic and struc- tural information of its words. We intend to create a NER system, based on Conditional Random Fields, which is specialized for names of people. We use HAREM (an annotated corpus of Named Entities for Portuguese) as our gold-standard corpus and results obtained for the annotation of the training set have precision value of 95% and recall value of 74%. Regarding the NER system, we obtained values for precision of 78% and for recall of 23%.","bibtype":"article","author":"Teixeira, Jorge","journal":"paginasfeuppt","bibtex":"@article{\n title = {Automatic Generation of a Training Set for NER on Portuguese journalistic text},\n type = {article},\n year = {2011},\n keywords = {Conditional Random Fields,Journalistic text,Named Entity Recognition,Natural Language Processing,Portuguese},\n pages = {25-36},\n websites = {http://paginas.fe.up.pt/~prodei/dsie11/images/pdfs/s1-3.pdf},\n id = {e7aceb77-0af6-34e5-92bc-66323e63846d},\n created = {2011-12-28T07:04:55.000Z},\n file_attached = {false},\n profile_id = {5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6},\n group_id = {066b42c8-f712-3fc3-abb2-225c158d2704},\n last_modified = {2017-03-14T14:36:19.698Z},\n tags = {named entities},\n read = {false},\n starred = {true},\n authored = {false},\n confirmed = {true},\n hidden = {false},\n citation_key = {Teixeira2011},\n private_publication = {false},\n abstract = {Tracking names of public personalities from news is nowa- days impossible to be performed without semi-automatic techniques, and usually require human intervention for annotation and validation of cor- pora. The main goal of this paper is to automatically generate a training set of news for Named Entity Recognition on journalist text. This allow us to build the entire pipe-line of a NER with no human intervention. A news corpus, containing 20,000 news crawled from online newspapers, is automatically annotated with a list, extracted from Voxx, of approx- imately one thousand names of well-known and frequently mentioned people on news. Additionally, we describe examples from this corpus with a set of feature vectors that include syntactic, semantic and struc- tural information of its words. We intend to create a NER system, based on Conditional Random Fields, which is specialized for names of people. We use HAREM (an annotated corpus of Named Entities for Portuguese) as our gold-standard corpus and results obtained for the annotation of the training set have precision value of 95% and recall value of 74%. Regarding the NER system, we obtained values for precision of 78% and for recall of 23%.},\n bibtype = {article},\n author = {Teixeira, Jorge},\n journal = {paginasfeuppt}\n}","author_short":["Teixeira, J."],"urls":{"Website":"http://paginas.fe.up.pt/~prodei/dsie11/images/pdfs/s1-3.pdf"},"bibbaseid":"teixeira-automaticgenerationofatrainingsetforneronportuguesejournalistictext-2011","role":"author","keyword":["Conditional Random Fields","Journalistic text","Named Entity Recognition","Natural Language Processing","Portuguese"],"downloads":0,"html":""},"bibtype":"article","creationDate":"2020-02-06T23:48:11.843Z","downloads":0,"keywords":["conditional random fields","journalistic text","named entity recognition","natural language processing","portuguese"],"search_terms":["automatic","generation","training","set","ner","portuguese","journalistic","text","teixeira"],"title":"Automatic Generation of a Training Set for NER on Portuguese journalistic text","year":2011}