Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches

Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches. Camps, J., Vidal-Gorène, C., & Vernet, M. In Barney Smith, E. H. & Pal, U., editors, Document Analysis and Recognition – ICDAR 2021 Workshops, of Lecture Notes in Computer Science, pages 306–316, Cham, 2021. Springer International Publishing. 🏷️ /unread、Handwritten text recognition、Text Recognition、Abbreviation、Abbreviations、Medieval western manuscripts
doi abstract bibtex

Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition. 【摘要翻译】虽然缩写在手写资料中相当常见，尤其是在中世纪和现代西方手稿中，但以前有关缩写扩展计算方法的研究却很少。然而，缩写对手写文本识别和自然语言处理任务等计算方法提出了特殊的挑战。通常情况下，预处理的最终目的是从数字化的源图像到规范化的文本，其中包括缩写的扩展。我们探讨了获得这种规范化文本的不同设置，要么直接在规范化（即扩展、去缩写）文本上训练 HTR 引擎，要么将处理过程分解为离散步骤，每个步骤都利用专业模型进行识别、单词分割和规范化。本文考虑的案例研究来自中世纪拉丁文传统。

@inproceedings{camps2021,
	address = {Cham},
	series = {Lecture {Notes} in {Computer} {Science}},
	title = {Handling {Heavily} {Abbreviated} {Manuscripts}: {HTR} {Engines} vs {Text} {Normalisation} {Approaches}},
	isbn = {978-3-030-86159-9},
	shorttitle = {处理高度缩写的手稿：{HTR} 引擎与文本规范化方法对比},
	doi = {10.1007/978-3-030-86159-9_21},
	abstract = {Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.

【摘要翻译】虽然缩写在手写资料中相当常见，尤其是在中世纪和现代西方手稿中，但以前有关缩写扩展计算方法的研究却很少。然而，缩写对手写文本识别和自然语言处理任务等计算方法提出了特殊的挑战。通常情况下，预处理的最终目的是从数字化的源图像到规范化的文本，其中包括缩写的扩展。我们探讨了获得这种规范化文本的不同设置，要么直接在规范化（即扩展、去缩写）文本上训练 HTR 引擎，要么将处理过程分解为离散步骤，每个步骤都利用专业模型进行识别、单词分割和规范化。本文考虑的案例研究来自中世纪拉丁文传统。},
	language = {en},
	booktitle = {Document {Analysis} and {Recognition} – {ICDAR} 2021 {Workshops}},
	publisher = {Springer International Publishing},
	author = {Camps, Jean-Baptiste and Vidal-Gorène, Chahan and Vernet, Marguerite},
	editor = {Barney Smith, Elisa H. and Pal, Umapada},
	year = {2021},
	note = {🏷️ /unread、Handwritten text recognition、Text Recognition、Abbreviation、Abbreviations、Medieval western manuscripts},
	keywords = {/unread, Abbreviation, Abbreviations, Handwritten text recognition, Medieval western manuscripts, Text Recognition},
	pages = {306--316},
}

Downloads: 0

{"_id":"C7b3n5dH9d6dQd6pe","bibbaseid":"camps-vidalgorne-vernet-handlingheavilyabbreviatedmanuscriptshtrenginesvstextnormalisationapproaches-2021","author_short":["Camps, J.","Vidal-Gorène, C.","Vernet, M."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Cham","series":"Lecture Notes in Computer Science","title":"Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches","isbn":"978-3-030-86159-9","shorttitle":"处理高度缩写的手稿：HTR 引擎与文本规范化方法对比","doi":"10.1007/978-3-030-86159-9_21","abstract":"Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition. 【摘要翻译】虽然缩写在手写资料中相当常见，尤其是在中世纪和现代西方手稿中，但以前有关缩写扩展计算方法的研究却很少。然而，缩写对手写文本识别和自然语言处理任务等计算方法提出了特殊的挑战。通常情况下，预处理的最终目的是从数字化的源图像到规范化的文本，其中包括缩写的扩展。我们探讨了获得这种规范化文本的不同设置，要么直接在规范化（即扩展、去缩写）文本上训练 HTR 引擎，要么将处理过程分解为离散步骤，每个步骤都利用专业模型进行识别、单词分割和规范化。本文考虑的案例研究来自中世纪拉丁文传统。","language":"en","booktitle":"Document Analysis and Recognition – ICDAR 2021 Workshops","publisher":"Springer International Publishing","author":[{"propositions":[],"lastnames":["Camps"],"firstnames":["Jean-Baptiste"],"suffixes":[]},{"propositions":[],"lastnames":["Vidal-Gorène"],"firstnames":["Chahan"],"suffixes":[]},{"propositions":[],"lastnames":["Vernet"],"firstnames":["Marguerite"],"suffixes":[]}],"editor":[{"propositions":[],"lastnames":["Barney","Smith"],"firstnames":["Elisa","H."],"suffixes":[]},{"propositions":[],"lastnames":["Pal"],"firstnames":["Umapada"],"suffixes":[]}],"year":"2021","note":"🏷️ /unread、Handwritten text recognition、Text Recognition、Abbreviation、Abbreviations、Medieval western manuscripts","keywords":"/unread, Abbreviation, Abbreviations, Handwritten text recognition, Medieval western manuscripts, Text Recognition","pages":"306–316","bibtex":"@inproceedings{camps2021,\n\taddress = {Cham},\n\tseries = {Lecture {Notes} in {Computer} {Science}},\n\ttitle = {Handling {Heavily} {Abbreviated} {Manuscripts}: {HTR} {Engines} vs {Text} {Normalisation} {Approaches}},\n\tisbn = {978-3-030-86159-9},\n\tshorttitle = {处理高度缩写的手稿：{HTR} 引擎与文本规范化方法对比},\n\tdoi = {10.1007/978-3-030-86159-9_21},\n\tabstract = {Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.\n\n【摘要翻译】虽然缩写在手写资料中相当常见，尤其是在中世纪和现代西方手稿中，但以前有关缩写扩展计算方法的研究却很少。然而，缩写对手写文本识别和自然语言处理任务等计算方法提出了特殊的挑战。通常情况下，预处理的最终目的是从数字化的源图像到规范化的文本，其中包括缩写的扩展。我们探讨了获得这种规范化文本的不同设置，要么直接在规范化（即扩展、去缩写）文本上训练 HTR 引擎，要么将处理过程分解为离散步骤，每个步骤都利用专业模型进行识别、单词分割和规范化。本文考虑的案例研究来自中世纪拉丁文传统。},\n\tlanguage = {en},\n\tbooktitle = {Document {Analysis} and {Recognition} – {ICDAR} 2021 {Workshops}},\n\tpublisher = {Springer International Publishing},\n\tauthor = {Camps, Jean-Baptiste and Vidal-Gorène, Chahan and Vernet, Marguerite},\n\teditor = {Barney Smith, Elisa H. and Pal, Umapada},\n\tyear = {2021},\n\tnote = {🏷️ /unread、Handwritten text recognition、Text Recognition、Abbreviation、Abbreviations、Medieval western manuscripts},\n\tkeywords = {/unread, Abbreviation, Abbreviations, Handwritten text recognition, Medieval western manuscripts, Text Recognition},\n\tpages = {306--316},\n}\n\n","author_short":["Camps, J.","Vidal-Gorène, C.","Vernet, M."],"editor_short":["Barney Smith, E. H.","Pal, U."],"key":"camps2021","id":"camps2021","bibbaseid":"camps-vidalgorne-vernet-handlingheavilyabbreviatedmanuscriptshtrenginesvstextnormalisationapproaches-2021","role":"author","urls":{},"keyword":["/unread","Abbreviation","Abbreviations","Handwritten text recognition","Medieval western manuscripts","Text Recognition"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://api.zotero.org/groups/2386895/collections/7PPRTB2H/items?format=bibtex&limit=100","dataSources":["u8q5uny4m5jJL9RcX"],"keywords":["/unread","abbreviation","abbreviations","handwritten text recognition","medieval western manuscripts","text recognition"],"search_terms":["handling","heavily","abbreviated","manuscripts","htr","engines","text","normalisation","approaches","camps","vidal-gorène","vernet"],"title":"Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches","year":2021}