Machine Translation Between High-resource Languages in a Language Documentation Setting

Machine Translation Between High-resource Languages in a Language Documentation Setting. Kann, K., Ebrahimi, A., Stenzel, K., & Palmer, A. In Proceedings of the first workshop on NLP applications to field linguistics, pages 26–33, Gyeongju, Republic of Korea, October, 2022. International Conference on Computational Linguistics.

Paper abstract bibtex

Language documentation encompasses translation, typically into the dominant high-resource language in the region where the target language is spoken. To make data accessible to a broader audience, additional translation into other high-resource languages might be needed. Working within a project documenting Kotiria, we explore the extent to which state-of-the-art machine translation (MT) systems can support this second translation – in our case from Portuguese to English. This translation task is challenging for multiple reasons: (1) the data is out-of-domain with respect to the MT system's training data, (2) much of the data is conversational, (3) existing translations include non-standard and uncommon expressions, often reflecting properties of the documented language, and (4) the data includes borrowings from other regional languages. Despite these challenges, existing MT systems perform at a usable level, though there is still room for improvement. We then conduct a qualitative analysis and suggest ways to improve MT between high-resource languages in a language documentation setting.

@inproceedings{kann_machine_2022,
	address = {Gyeongju, Republic of Korea},
	title = {Machine {Translation} {Between} {High}-resource {Languages} in a {Language} {Documentation} {Setting}},
	url = {https://aclanthology.org/2022.fieldmatters-1.3},
	abstract = {Language documentation encompasses translation, typically into the dominant high-resource language in the region where the target language is spoken. To make data accessible to a broader audience, additional translation into other high-resource languages might be needed. Working within a project documenting Kotiria, we explore the extent to which state-of-the-art machine translation (MT) systems can support this second translation – in our case from Portuguese to English. This translation task is challenging for multiple reasons: (1) the data is out-of-domain with respect to the MT system's training data, (2) much of the data is conversational, (3) existing translations include non-standard and uncommon expressions, often reflecting properties of the documented language, and (4) the data includes borrowings from other regional languages. Despite these challenges, existing MT systems perform at a usable level, though there is still room for improvement. We then conduct a qualitative analysis and suggest ways to improve MT between high-resource languages in a language documentation setting.},
	urldate = {2023-01-09},
	booktitle = {Proceedings of the first workshop on {NLP} applications to field linguistics},
	publisher = {International Conference on Computational Linguistics},
	author = {Kann, Katharina and Ebrahimi, Abteen and Stenzel, Kristine and Palmer, Alexis},
	month = oct,
	year = {2022},
	pages = {26--33},
}

Downloads: 0

{"_id":"zAzYBsCvPdnuZNqCG","bibbaseid":"kann-ebrahimi-stenzel-palmer-machinetranslationbetweenhighresourcelanguagesinalanguagedocumentationsetting-2022","author_short":["Kann, K.","Ebrahimi, A.","Stenzel, K.","Palmer, A."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Gyeongju, Republic of Korea","title":"Machine Translation Between High-resource Languages in a Language Documentation Setting","url":"https://aclanthology.org/2022.fieldmatters-1.3","abstract":"Language documentation encompasses translation, typically into the dominant high-resource language in the region where the target language is spoken. To make data accessible to a broader audience, additional translation into other high-resource languages might be needed. Working within a project documenting Kotiria, we explore the extent to which state-of-the-art machine translation (MT) systems can support this second translation – in our case from Portuguese to English. This translation task is challenging for multiple reasons: (1) the data is out-of-domain with respect to the MT system's training data, (2) much of the data is conversational, (3) existing translations include non-standard and uncommon expressions, often reflecting properties of the documented language, and (4) the data includes borrowings from other regional languages. Despite these challenges, existing MT systems perform at a usable level, though there is still room for improvement. We then conduct a qualitative analysis and suggest ways to improve MT between high-resource languages in a language documentation setting.","urldate":"2023-01-09","booktitle":"Proceedings of the first workshop on NLP applications to field linguistics","publisher":"International Conference on Computational Linguistics","author":[{"propositions":[],"lastnames":["Kann"],"firstnames":["Katharina"],"suffixes":[]},{"propositions":[],"lastnames":["Ebrahimi"],"firstnames":["Abteen"],"suffixes":[]},{"propositions":[],"lastnames":["Stenzel"],"firstnames":["Kristine"],"suffixes":[]},{"propositions":[],"lastnames":["Palmer"],"firstnames":["Alexis"],"suffixes":[]}],"month":"October","year":"2022","pages":"26–33","bibtex":"@inproceedings{kann_machine_2022,\n\taddress = {Gyeongju, Republic of Korea},\n\ttitle = {Machine {Translation} {Between} {High}-resource {Languages} in a {Language} {Documentation} {Setting}},\n\turl = {https://aclanthology.org/2022.fieldmatters-1.3},\n\tabstract = {Language documentation encompasses translation, typically into the dominant high-resource language in the region where the target language is spoken. To make data accessible to a broader audience, additional translation into other high-resource languages might be needed. Working within a project documenting Kotiria, we explore the extent to which state-of-the-art machine translation (MT) systems can support this second translation – in our case from Portuguese to English. This translation task is challenging for multiple reasons: (1) the data is out-of-domain with respect to the MT system's training data, (2) much of the data is conversational, (3) existing translations include non-standard and uncommon expressions, often reflecting properties of the documented language, and (4) the data includes borrowings from other regional languages. Despite these challenges, existing MT systems perform at a usable level, though there is still room for improvement. We then conduct a qualitative analysis and suggest ways to improve MT between high-resource languages in a language documentation setting.},\n\turldate = {2023-01-09},\n\tbooktitle = {Proceedings of the first workshop on {NLP} applications to field linguistics},\n\tpublisher = {International Conference on Computational Linguistics},\n\tauthor = {Kann, Katharina and Ebrahimi, Abteen and Stenzel, Kristine and Palmer, Alexis},\n\tmonth = oct,\n\tyear = {2022},\n\tpages = {26--33},\n}\n\n\n\n","author_short":["Kann, K.","Ebrahimi, A.","Stenzel, K.","Palmer, A."],"key":"kann_machine_2022","id":"kann_machine_2022","bibbaseid":"kann-ebrahimi-stenzel-palmer-machinetranslationbetweenhighresourcelanguagesinalanguagedocumentationsetting-2022","role":"author","urls":{"Paper":"https://aclanthology.org/2022.fieldmatters-1.3"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/abhishek-p","dataSources":["h7kKWXpJh2iaX92T5"],"keywords":[],"search_terms":["machine","translation","between","high","resource","languages","language","documentation","setting","kann","ebrahimi","stenzel","palmer"],"title":"Machine Translation Between High-resource Languages in a Language Documentation Setting","year":2022}