Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori. Coto-Solano, R., Li, D., Ferraz, M. T., Sasse, O., Krupka, C., Loáiciga, S., & Nicholas, S. A. T. arXiv.org, Cornell University Library, arXiv.org, Ithaca, United States, December, 2025.
Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori [link]Paper  abstract   bibtex   
We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.
@article{coto-solano_diacritic_2025,
	address = {Ithaca, United States},
	chapter = {cs:cs:CL},
	title = {Diacritic {Restoration} for {Low}-{Resource} {Indigenous} {Languages}: {Case} {Study} with {Bribri} and {Cook} {Islands} {Māori}},
	copyright = {© 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”).  Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.},
	shorttitle = {Diacritic {Restoration} for {Low}-{Resource} {Indigenous} {Languages}},
	url = {https://www.proquest.com/publiccontent/docview/3303186974?pq-origsite=primo&sourcetype=Working%20Papers#},
	abstract = {We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.},
	language = {English},
	urldate = {2026-03-10},
	journal = {arXiv.org},
	publisher = {Cornell University Library, arXiv.org},
	author = {Coto-Solano, Rolando and Li, Daisy and Ferraz, Manoela Teleginski and Sasse, Olivia and Krupka, Cha and Loáiciga, Sharid and Nicholas, Sally Akevai Tenamu},
	month = dec,
	year = {2025},
	keywords = {Computation and Language, Natural language processing, Restoration},
}

Downloads: 0