Named Entity Recognition and Classification on Historical Documents: A Survey

Named Entity Recognition and Classification on Historical Documents: A Survey. Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., & Doucet, A. ACM Computing Surveys, 56(2):1–47, September, 2021. arXiv:2109.11406 [cs]

Paper doi abstract bibtex

After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.

@article{ehrmann_named_2021,
	title = {Named {Entity} {Recognition} and {Classification} on {Historical} {Documents}: {A} {Survey}},
	volume = {56},
	issn = {0360-0300, 1557-7341},
	shorttitle = {Named {Entity} {Recognition} and {Classification} on {Historical} {Documents}},
	url = {http://arxiv.org/abs/2109.11406},
	doi = {10.1145/3604931},
	abstract = {After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.},
	number = {2},
	urldate = {2025-01-01},
	journal = {ACM Computing Surveys},
	author = {Ehrmann, Maud and Hamdi, Ahmed and Pontes, Elvys Linhares and Romanello, Matteo and Doucet, Antoine},
	month = sep,
	year = {2021},
	note = {arXiv:2109.11406 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},
	pages = {1--47},
}

Downloads: 0

{"_id":"mwxkQrndyPjACzvGC","bibbaseid":"ehrmann-hamdi-pontes-romanello-doucet-namedentityrecognitionandclassificationonhistoricaldocumentsasurvey-2021","author_short":["Ehrmann, M.","Hamdi, A.","Pontes, E. L.","Romanello, M.","Doucet, A."],"bibdata":{"bibtype":"article","type":"article","title":"Named Entity Recognition and Classification on Historical Documents: A Survey","volume":"56","issn":"0360-0300, 1557-7341","shorttitle":"Named Entity Recognition and Classification on Historical Documents","url":"http://arxiv.org/abs/2109.11406","doi":"10.1145/3604931","abstract":"After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.","number":"2","urldate":"2025-01-01","journal":"ACM Computing Surveys","author":[{"propositions":[],"lastnames":["Ehrmann"],"firstnames":["Maud"],"suffixes":[]},{"propositions":[],"lastnames":["Hamdi"],"firstnames":["Ahmed"],"suffixes":[]},{"propositions":[],"lastnames":["Pontes"],"firstnames":["Elvys","Linhares"],"suffixes":[]},{"propositions":[],"lastnames":["Romanello"],"firstnames":["Matteo"],"suffixes":[]},{"propositions":[],"lastnames":["Doucet"],"firstnames":["Antoine"],"suffixes":[]}],"month":"September","year":"2021","note":"arXiv:2109.11406 [cs]","keywords":"Computer Science - Computation and Language, Computer Science - Machine Learning","pages":"1–47","bibtex":"@article{ehrmann_named_2021,\n\ttitle = {Named {Entity} {Recognition} and {Classification} on {Historical} {Documents}: {A} {Survey}},\n\tvolume = {56},\n\tissn = {0360-0300, 1557-7341},\n\tshorttitle = {Named {Entity} {Recognition} and {Classification} on {Historical} {Documents}},\n\turl = {http://arxiv.org/abs/2109.11406},\n\tdoi = {10.1145/3604931},\n\tabstract = {After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.},\n\tnumber = {2},\n\turldate = {2025-01-01},\n\tjournal = {ACM Computing Surveys},\n\tauthor = {Ehrmann, Maud and Hamdi, Ahmed and Pontes, Elvys Linhares and Romanello, Matteo and Doucet, Antoine},\n\tmonth = sep,\n\tyear = {2021},\n\tnote = {arXiv:2109.11406 [cs]},\n\tkeywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},\n\tpages = {1--47},\n}\n\n\n\n","author_short":["Ehrmann, M.","Hamdi, A.","Pontes, E. L.","Romanello, M.","Doucet, A."],"key":"ehrmann_named_2021-1","id":"ehrmann_named_2021-1","bibbaseid":"ehrmann-hamdi-pontes-romanello-doucet-namedentityrecognitionandclassificationonhistoricaldocumentsasurvey-2021","role":"author","urls":{"Paper":"http://arxiv.org/abs/2109.11406"},"keyword":["Computer Science - Computation and Language","Computer Science - Machine Learning"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://bibbase.org/zotero-group/schulzkx/5158478","dataSources":["JFDnASMkoQCjjGL8E"],"keywords":["computer science - computation and language","computer science - machine learning"],"search_terms":["named","entity","recognition","classification","historical","documents","survey","ehrmann","hamdi","pontes","romanello","doucet"],"title":"Named Entity Recognition and Classification on Historical Documents: A Survey","year":2021}