A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek

A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek. Singh, P., Rutten, G., & Lefever, E. In Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 128–137, Punta Cana, Dominican Republic (online), November, 2021. Association for Computational Linguistics.

Paper doi abstract bibtex

This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT

@inproceedings{singh_pilot_2021,
	address = {Punta Cana, Dominican Republic (online)},
	title = {A {Pilot} {Study} for {BERT} {Language} {Modelling} and {Morphological} {Analysis} for {Ancient} and {Medieval} {Greek}},
	url = {https://aclanthology.org/2021.latechclfl-1.15},
	doi = {10.18653/v1/2021.latechclfl-1.15},
	abstract = {This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT},
	urldate = {2023-10-05},
	booktitle = {Proceedings of the 5th {Joint} {SIGHUM} {Workshop} on {Computational} {Linguistics} for {Cultural} {Heritage}, {Social} {Sciences}, {Humanities} and {Literature}},
	publisher = {Association for Computational Linguistics},
	author = {Singh, Pranaydeep and Rutten, Gorik and Lefever, Els},
	month = nov,
	year = {2021},
	pages = {128--137},
}

Downloads: 0

{"_id":"xSjzMkb2JPi2iHbPZ","bibbaseid":"singh-rutten-lefever-apilotstudyforbertlanguagemodellingandmorphologicalanalysisforancientandmedievalgreek-2021","author_short":["Singh, P.","Rutten, G.","Lefever, E."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Punta Cana, Dominican Republic (online)","title":"A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek","url":"https://aclanthology.org/2021.latechclfl-1.15","doi":"10.18653/v1/2021.latechclfl-1.15","abstract":"This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT","urldate":"2023-10-05","booktitle":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","publisher":"Association for Computational Linguistics","author":[{"propositions":[],"lastnames":["Singh"],"firstnames":["Pranaydeep"],"suffixes":[]},{"propositions":[],"lastnames":["Rutten"],"firstnames":["Gorik"],"suffixes":[]},{"propositions":[],"lastnames":["Lefever"],"firstnames":["Els"],"suffixes":[]}],"month":"November","year":"2021","pages":"128–137","bibtex":"@inproceedings{singh_pilot_2021,\n\taddress = {Punta Cana, Dominican Republic (online)},\n\ttitle = {A {Pilot} {Study} for {BERT} {Language} {Modelling} and {Morphological} {Analysis} for {Ancient} and {Medieval} {Greek}},\n\turl = {https://aclanthology.org/2021.latechclfl-1.15},\n\tdoi = {10.18653/v1/2021.latechclfl-1.15},\n\tabstract = {This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT},\n\turldate = {2023-10-05},\n\tbooktitle = {Proceedings of the 5th {Joint} {SIGHUM} {Workshop} on {Computational} {Linguistics} for {Cultural} {Heritage}, {Social} {Sciences}, {Humanities} and {Literature}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Singh, Pranaydeep and Rutten, Gorik and Lefever, Els},\n\tmonth = nov,\n\tyear = {2021},\n\tpages = {128--137},\n}\n\n\n\n","author_short":["Singh, P.","Rutten, G.","Lefever, E."],"key":"singh_pilot_2021","id":"singh_pilot_2021","bibbaseid":"singh-rutten-lefever-apilotstudyforbertlanguagemodellingandmorphologicalanalysisforancientandmedievalgreek-2021","role":"author","urls":{"Paper":"https://aclanthology.org/2021.latechclfl-1.15"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero-group/schulzkx/5158478","dataSources":["JFDnASMkoQCjjGL8E"],"keywords":[],"search_terms":["pilot","study","bert","language","modelling","morphological","analysis","ancient","medieval","greek","singh","rutten","lefever"],"title":"A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek","year":2021}