Language Resources for Historical Newspapers: the Impresso Collection

Language Resources for Historical Newspapers: the Impresso Collection. European Language Resources Association (ELRA), 2020. 🏷️ /unread

Paper doi abstract bibtex

Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster eﬀicient processing of historical documents. 【摘要翻译】经过数十年的大规模数字化，现在可以通过文化遗产在线门户网站检索和访问前所未有的大量历史文献摹本。如果说这代表着在保存和获取方面向前迈出了一大步，那么数字化的下一个基本挑战和真正的前景就是如何利用这些数字资产的内容，因此需要调整和开发适当的语言技术，以便从这些 "过去的大数据 "中搜索和检索信息。然而，将文本处理工具应用于一般历史文献，特别是历史报纸，会带来新的挑战，关键是需要适当的语言资源。在此背景下，本文介绍了由文本和图像资源组成的历史报纸数据集，这些数据集是在 "impressiono–过去的媒体监测 "项目背景下策划和发布的。impresso 资源集包含法文、德文和卢森堡文的语料库、基准、语义注释和语言模型，涵盖了约 200 年的历史，其目的是为历史语言资源做出贡献，从而加强非标准输入方法的稳健性，促进历史文件的高效处理。

@article{2020c,
	title = {Language {Resources} for {Historical} {Newspapers}: the {Impresso} {Collection}},
	shorttitle = {历史报纸的语言资源：{Impresso} 语料库},
	url = {https://www.zora.uzh.ch/id/eprint/191270},
	doi = {10.5167/UZH-191270},
	abstract = {Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster eﬀicient processing of historical documents.

【摘要翻译】经过数十年的大规模数字化，现在可以通过文化遗产在线门户网站检索和访问前所 未有的大量历史文献摹本。如果说这代表着在保存和获取方面向前迈出了一大步，那么数字化的下一个基本挑战和真正的前景就是如何利用这些数字资产的内容，因此需要调整和开发适当的语言技术，以便从这些 "过去的大数据 "中搜索和检索信息。然而，将文本处理工具应用于一般历史文献，特别是历史报纸，会带来新的挑战，关键是需要适当的语言资源。在此背景下，本文介绍了由文本和图像资源组成的历史报纸数据集，这些数据集是在 "impressiono--过去的媒体监测 "项目背景下策划和发布的。impresso 资源集包含法文、德文和卢森堡文的语料库、基准、语义注释和语言模型，涵盖了约 200 年的历史，其目的是为历史语言资源做出贡献，从而加强非标准输入方法的稳健性，促进历史文件的高效处理。},
	language = {en},
	urldate = {2021-06-08},
	journal = {European Language Resources Association (ELRA)},
	year = {2020},
	note = {🏷️ /unread},
	keywords = {/unread},
}

Downloads: 0

{"_id":"DCvQ3qxhKuap2gMjB","bibbaseid":"anonymous-languageresourcesforhistoricalnewspaperstheimpressocollection-2020","bibdata":{"bibtype":"article","type":"article","title":"Language Resources for Historical Newspapers: the Impresso Collection","shorttitle":"历史报纸的语言资源：Impresso 语料库","url":"https://www.zora.uzh.ch/id/eprint/191270","doi":"10.5167/UZH-191270","abstract":"Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster eﬀicient processing of historical documents. 【摘要翻译】经过数十年的大规模数字化，现在可以通过文化遗产在线门户网站检索和访问前所未有的大量历史文献摹本。如果说这代表着在保存和获取方面向前迈出了一大步，那么数字化的下一个基本挑战和真正的前景就是如何利用这些数字资产的内容，因此需要调整和开发适当的语言技术，以便从这些 \"过去的大数据 \"中搜索和检索信息。然而，将文本处理工具应用于一般历史文献，特别是历史报纸，会带来新的挑战，关键是需要适当的语言资源。在此背景下，本文介绍了由文本和图像资源组成的历史报纸数据集，这些数据集是在 \"impressiono–过去的媒体监测 \"项目背景下策划和发布的。impresso 资源集包含法文、德文和卢森堡文的语料库、基准、语义注释和语言模型，涵盖了约 200 年的历史，其目的是为历史语言资源做出贡献，从而加强非标准输入方法的稳健性，促进历史文件的高效处理。","language":"en","urldate":"2021-06-08","journal":"European Language Resources Association (ELRA)","year":"2020","note":"🏷️ /unread","keywords":"/unread","bibtex":"@article{2020c,\n\ttitle = {Language {Resources} for {Historical} {Newspapers}: the {Impresso} {Collection}},\n\tshorttitle = {历史报纸的语言资源：{Impresso} 语料库},\n\turl = {https://www.zora.uzh.ch/id/eprint/191270},\n\tdoi = {10.5167/UZH-191270},\n\tabstract = {Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster eﬀicient processing of historical documents.\n\n【摘要翻译】经过数十年的大规模数字化，现在可以通过文化遗产在线门户网站检索和访问前所未有的大量历史文献摹本。如果说这代表着在保存和获取方面向前迈出了一大步，那么数字化的下一个基本挑战和真正的前景就是如何利用这些数字资产的内容，因此需要调整和开发适当的语言技术，以便从这些 \"过去的大数据 \"中搜索和检索信息。然而，将文本处理工具应用于一般历史文献，特别是历史报纸，会带来新的挑战，关键是需要适当的语言资源。在此背景下，本文介绍了由文本和图像资源组成的历史报纸数据集，这些数据集是在 \"impressiono--过去的媒体监测 \"项目背景下策划和发布的。impresso 资源集包含法文、德文和卢森堡文的语料库、基准、语义注释和语言模型，涵盖了约 200 年的历史，其目的是为历史语言资源做出贡献，从而加强非标准输入方法的稳健性，促进历史文件的高效处理。},\n\tlanguage = {en},\n\turldate = {2021-06-08},\n\tjournal = {European Language Resources Association (ELRA)},\n\tyear = {2020},\n\tnote = {🏷️ /unread},\n\tkeywords = {/unread},\n}\n\n","key":"2020c","id":"2020c","bibbaseid":"anonymous-languageresourcesforhistoricalnewspaperstheimpressocollection-2020","role":"","urls":{"Paper":"https://www.zora.uzh.ch/id/eprint/191270"},"keyword":["/unread"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://api.zotero.org/groups/2386895/collections/XHYP2IV7/items?format=bibtex&limit=100","dataSources":["L79iy7WyzCDgb996i"],"keywords":["/unread"],"search_terms":["language","resources","historical","newspapers","impresso","collection"],"title":"Language Resources for Historical Newspapers: the Impresso Collection","year":2020}