Between automatic and manual encoding. Pinche, A., Christensen, K., & Gabay, S. In TEI 2022 conference : Text as data, Newcastle, United Kingdom, September, 2022.
Between automatic and manual encoding [link]Paper  doi  abstract   bibtex   
Cultural heritage institutions today aim to digitise their collections of prints and manuscripts (Bermès 2020) and are generating more and more digital images (Gray 2009). To enrich these images, many institutions work with standardised formats such as IIIF, preserving as much of the source’s information as possible. To take full advantage of textual documents, an image alone is not enough. Thanks to automatic text recognition technology, it is now possible to extract images’ content on a large scale. The TEI seems to provide the perfect format to capture both an image’s formal and textual data (Janès et al. 2021). However, this poses a problem. To ensure compatibility with a range of use cases, TEI XML files must guarantee IIIF or RDF exports and therefore must be based on strict data structures that can be automated. But a rigid structure contradicts the basic principles of philology, which require maximum flexibility to cope with various situations. The solution proposed by the Gallic(orpor)a project1 attempted to deal with such a contradiction, focusing on French historical documents produced between the 15th and the 18th c. It aims to enrich the digital facsimiles distributed by the French National Library (BnF).
@inproceedings{pinche2022,
	address = {Newcastle, United Kingdom},
	title = {Between automatic and manual encoding},
	url = {https://hal.science/hal-03780302},
	doi = {10.5281/zenodo.7092214},
	abstract = {Cultural heritage institutions today aim to digitise their collections of prints and
manuscripts (Bermès 2020) and are generating more and more digital images (Gray
2009). To enrich these images, many institutions work with standardised formats such as
IIIF, preserving as much of the source’s information as possible. To take full advantage of
textual documents, an image alone is not enough. Thanks to automatic text recognition
technology, it is now possible to extract images’ content on a large scale. The TEI seems
to provide the perfect format to capture both an image’s formal and textual data (Janès
et al. 2021). However, this poses a problem. To ensure compatibility with a range of
use cases, TEI XML files must guarantee IIIF or RDF exports and therefore must be
based on strict data structures that can be automated. But a rigid structure contradicts
the basic principles of philology, which require maximum flexibility to cope with various
situations. The solution proposed by the Gallic(orpor)a project1 attempted to deal with such a
contradiction, focusing on French historical documents produced between the 15th and
the 18th c. It aims to enrich the digital facsimiles distributed by the French National
Library (BnF).},
	urldate = {2024-01-03},
	booktitle = {{TEI} 2022 conference : {Text} as data},
	author = {Pinche, Ariane and Christensen, Kelly and Gabay, Simon},
	month = sep,
	year = {2022},
	keywords = {HTR, Pipeline, TEI},
}

Downloads: 0