Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early $$20\textasciicircum\th\$$Century Paris Census

Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early $$20\textasciicircum\th\$$Century Paris Census. Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Brée, S., & Merveille, F. In Uchida, S., Barney, E., & Eglin, V., editors, Document Analysis Systems, pages 143–157, Cham, 2022. Springer International Publishing.
doi abstract bibtex

We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.

@inproceedings{constumRecognitionInformationExtraction2022,
	address = {Cham},
	title = {Recognition and {Information} {Extraction} in {Historical} {Handwritten} {Tables}: {Toward} {Understanding} {Early} \$\$20{\textasciicircum}\{th\}\$\${Century} {Paris} {Census}},
	isbn = {978-3-031-06555-2},
	shorttitle = {Recognition and {Information} {Extraction} in {Historical} {Handwritten} {Tables}},
	doi = {10.1007/978-3-031-06555-2_10},
	abstract = {We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.},
	language = {en},
	booktitle = {Document {Analysis} {Systems}},
	publisher = {Springer International Publishing},
	author = {Constum, Thomas and Kempf, Nicolas and Paquet, Thierry and Tranouez, Pierrick and Chatelain, Clément and Brée, Sandra and Merveille, François},
	editor = {Uchida, Seiichi and Barney, Elisa and Eglin, Véronique},
	year = {2022},
	keywords = {Document layout analysis, Handwriting recognition, Self-training, Semi-supervised learning, Table analysis, WFST, handwritten text recognition, table recognition},
	pages = {143--157},
}

Downloads: 0

{"_id":"o7R24Ba6fHAGByjQj","bibbaseid":"constum-kempf-paquet-tranouez-chatelain-bre-merveille-recognitionandinformationextractioninhistoricalhandwrittentablestowardunderstandingearly20textasciicircumthcenturypariscensus-2022","author_short":["Constum, T.","Kempf, N.","Paquet, T.","Tranouez, P.","Chatelain, C.","Brée, S.","Merveille, F."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Cham","title":"Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early $$20\\textasciicircum\\th\\$$Century Paris Census","isbn":"978-3-031-06555-2","shorttitle":"Recognition and Information Extraction in Historical Handwritten Tables","doi":"10.1007/978-3-031-06555-2_10","abstract":"We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.","language":"en","booktitle":"Document Analysis Systems","publisher":"Springer International Publishing","author":[{"propositions":[],"lastnames":["Constum"],"firstnames":["Thomas"],"suffixes":[]},{"propositions":[],"lastnames":["Kempf"],"firstnames":["Nicolas"],"suffixes":[]},{"propositions":[],"lastnames":["Paquet"],"firstnames":["Thierry"],"suffixes":[]},{"propositions":[],"lastnames":["Tranouez"],"firstnames":["Pierrick"],"suffixes":[]},{"propositions":[],"lastnames":["Chatelain"],"firstnames":["Clément"],"suffixes":[]},{"propositions":[],"lastnames":["Brée"],"firstnames":["Sandra"],"suffixes":[]},{"propositions":[],"lastnames":["Merveille"],"firstnames":["François"],"suffixes":[]}],"editor":[{"propositions":[],"lastnames":["Uchida"],"firstnames":["Seiichi"],"suffixes":[]},{"propositions":[],"lastnames":["Barney"],"firstnames":["Elisa"],"suffixes":[]},{"propositions":[],"lastnames":["Eglin"],"firstnames":["Véronique"],"suffixes":[]}],"year":"2022","keywords":"Document layout analysis, Handwriting recognition, Self-training, Semi-supervised learning, Table analysis, WFST, handwritten text recognition, table recognition","pages":"143–157","bibtex":"@inproceedings{constumRecognitionInformationExtraction2022,\n\taddress = {Cham},\n\ttitle = {Recognition and {Information} {Extraction} in {Historical} {Handwritten} {Tables}: {Toward} {Understanding} {Early} \\$\\$20{\\textasciicircum}\\{th\\}\\$\\${Century} {Paris} {Census}},\n\tisbn = {978-3-031-06555-2},\n\tshorttitle = {Recognition and {Information} {Extraction} in {Historical} {Handwritten} {Tables}},\n\tdoi = {10.1007/978-3-031-06555-2_10},\n\tabstract = {We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.},\n\tlanguage = {en},\n\tbooktitle = {Document {Analysis} {Systems}},\n\tpublisher = {Springer International Publishing},\n\tauthor = {Constum, Thomas and Kempf, Nicolas and Paquet, Thierry and Tranouez, Pierrick and Chatelain, Clément and Brée, Sandra and Merveille, François},\n\teditor = {Uchida, Seiichi and Barney, Elisa and Eglin, Véronique},\n\tyear = {2022},\n\tkeywords = {Document layout analysis, Handwriting recognition, Self-training, Semi-supervised learning, Table analysis, WFST, handwritten text recognition, table recognition},\n\tpages = {143--157},\n}\n\n","author_short":["Constum, T.","Kempf, N.","Paquet, T.","Tranouez, P.","Chatelain, C.","Brée, S.","Merveille, F."],"editor_short":["Uchida, S.","Barney, E.","Eglin, V."],"key":"constumRecognitionInformationExtraction2022","id":"constumRecognitionInformationExtraction2022","bibbaseid":"constum-kempf-paquet-tranouez-chatelain-bre-merveille-recognitionandinformationextractioninhistoricalhandwrittentablestowardunderstandingearly20textasciicircumthcenturypariscensus-2022","role":"author","urls":{},"keyword":["Document layout analysis","Handwriting recognition","Self-training","Semi-supervised learning","Table analysis","WFST","handwritten text recognition","table recognition"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://api.zotero.org/groups/2386895/collections/7PPRTB2H/items?format=bibtex&limit=100","dataSources":["u8q5uny4m5jJL9RcX"],"keywords":["document layout analysis","handwriting recognition","self-training","semi-supervised learning","table analysis","wfst","handwritten text recognition","table recognition"],"search_terms":["recognition","information","extraction","historical","handwritten","tables","toward","understanding","early","textasciicircum","century","paris","census","constum","kempf","paquet","tranouez","chatelain","brée","merveille"],"title":"Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early $$20\\textasciicircum\\th\\$$Century Paris Census","year":2022}