LatinCy: Synthetic Trained Pipelines for Latin NLP

LatinCy: Synthetic Trained Pipelines for Latin NLP. Burns, P. J. May, 2023. arXiv:2305.04365 [cs] version: 1

This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework. The models are trained on a large amount of available Latin data, including all five of the Latin Universal Dependency treebanks, which have been preprocessed to be compatible with each other. The result is a set of general models for Latin with good performance on a number of natural language processing tasks (e.g. the top-performing model yields POS tagging, 97.41% accuracy; lemmatization, 94.66% accuracy; morphological tagging 92.76% accuracy). The paper describes the model training, including its training data and parameterization, and presents the advantages to Latin-language researchers of having a spaCy model available for NLP work.

@misc{burns_latincy_2023,
	title = {{LatinCy}: {Synthetic} {Trained} {Pipelines} for {Latin} {NLP}},
	shorttitle = {{LatinCy}},
	url = {http://arxiv.org/abs/2305.04365},
	doi = {10.48550/arXiv.2305.04365},
	abstract = {This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework. The models are trained on a large amount of available Latin data, including all five of the Latin Universal Dependency treebanks, which have been preprocessed to be compatible with each other. The result is a set of general models for Latin with good performance on a number of natural language processing tasks (e.g. the top-performing model yields POS tagging, 97.41\% accuracy; lemmatization, 94.66\% accuracy; morphological tagging 92.76\% accuracy). The paper describes the model training, including its training data and parameterization, and presents the advantages to Latin-language researchers of having a spaCy model available for NLP work.},
	urldate = {2023-07-16},
	publisher = {arXiv},
	author = {Burns, Patrick J.},
	month = may,
	year = {2023},
	note = {arXiv:2305.04365 [cs]
version: 1},
	keywords = {Computer Science - Computation and Language},
}

Downloads: 0

{"_id":"dZxxD7RYzqugBNjdA","bibbaseid":"burns-latincysynthetictrainedpipelinesforlatinnlp-2023","author_short":["Burns, P. J."],"bibdata":{"bibtype":"misc","type":"misc","title":"LatinCy: Synthetic Trained Pipelines for Latin NLP","shorttitle":"LatinCy","url":"http://arxiv.org/abs/2305.04365","doi":"10.48550/arXiv.2305.04365","abstract":"This paper introduces LatinCy, a set of trained general purpose Latin-language \"core\" pipelines for use with the spaCy natural language processing framework. The models are trained on a large amount of available Latin data, including all five of the Latin Universal Dependency treebanks, which have been preprocessed to be compatible with each other. The result is a set of general models for Latin with good performance on a number of natural language processing tasks (e.g. the top-performing model yields POS tagging, 97.41% accuracy; lemmatization, 94.66% accuracy; morphological tagging 92.76% accuracy). The paper describes the model training, including its training data and parameterization, and presents the advantages to Latin-language researchers of having a spaCy model available for NLP work.","urldate":"2023-07-16","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Burns"],"firstnames":["Patrick","J."],"suffixes":[]}],"month":"May","year":"2023","note":"arXiv:2305.04365 [cs] version: 1","keywords":"Computer Science - Computation and Language","bibtex":"@misc{burns_latincy_2023,\n\ttitle = {{LatinCy}: {Synthetic} {Trained} {Pipelines} for {Latin} {NLP}},\n\tshorttitle = {{LatinCy}},\n\turl = {http://arxiv.org/abs/2305.04365},\n\tdoi = {10.48550/arXiv.2305.04365},\n\tabstract = {This paper introduces LatinCy, a set of trained general purpose Latin-language \"core\" pipelines for use with the spaCy natural language processing framework. The models are trained on a large amount of available Latin data, including all five of the Latin Universal Dependency treebanks, which have been preprocessed to be compatible with each other. The result is a set of general models for Latin with good performance on a number of natural language processing tasks (e.g. the top-performing model yields POS tagging, 97.41\\% accuracy; lemmatization, 94.66\\% accuracy; morphological tagging 92.76\\% accuracy). The paper describes the model training, including its training data and parameterization, and presents the advantages to Latin-language researchers of having a spaCy model available for NLP work.},\n\turldate = {2023-07-16},\n\tpublisher = {arXiv},\n\tauthor = {Burns, Patrick J.},\n\tmonth = may,\n\tyear = {2023},\n\tnote = {arXiv:2305.04365 [cs]\nversion: 1},\n\tkeywords = {Computer Science - Computation and Language},\n}\n","author_short":["Burns, P. J."],"key":"burns_latincy_2023","id":"burns_latincy_2023","bibbaseid":"burns-latincysynthetictrainedpipelinesforlatinnlp-2023","role":"author","urls":{"Paper":"http://arxiv.org/abs/2305.04365"},"keyword":["Computer Science - Computation and Language"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/schulzkx/5158478","dataSources":["JFDnASMkoQCjjGL8E"],"keywords":["computer science - computation and language"],"search_terms":["latincy","synthetic","trained","pipelines","latin","nlp","burns"],"title":"LatinCy: Synthetic Trained Pipelines for Latin NLP","year":2023}