Ground Truths in the Humanities.
Oortwijn, Y.; Van den Berg, H.; and Betti, A.
In
5th Workshop on Visualization for the Digital Humanities (VIS4DH), Salt Lake City, 2020.
Paper
link
bibtex
abstract
10 downloads
@inproceedings{oortwijn_ground_2020,
address = {Salt Lake City},
title = {Ground {Truths} in the {Humanities}},
url = {https://arxiv.org/pdf/2103.12841.pdf},
abstract = {Ensuring a faithful interaction with data and its representation for humanities
can and should depend on expert-constructed ground truths.},
booktitle = {5th {Workshop} on {Visualization} for the {Digital} {Humanities} ({VIS4DH})},
author = {Oortwijn, Yvette and Van den Berg, Hein and Betti, Arianna},
year = {2020},
}
Ensuring a faithful interaction with data and its representation for humanities can and should depend on expert-constructed ground truths.
Quine’s semantic space: Count or predict semantic vectors from small in-domain data? [unpublished draft].
Bloem, J.; Betti, A.; Oortwijn, Y.; Reynaert, M.; and Ossenkoppele, T.
In 2020.
link
bibtex
abstract
@inproceedings{bloem_quines_2020,
title = {Quine’s semantic space: {Count} or predict semantic vectors from small in-domain data? [unpublished draft]},
abstract = {We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe performance of different types of distributional semantic models in a challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.},
author = {Bloem, Jelke and Betti, Arianna and Oortwijn, Yvette and Reynaert, Martin and Ossenkoppele, Thijs},
year = {2020},
}
We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe performance of different types of distributional semantic models in a challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.
Expert Concept-Modeling Ground Truth Construction for Word Embeddings Evaluation in Concept-Focused Domains.
Betti, A.; Reynaert, M.; Ossenkoppele, T.; Oortwijn, Y.; Salway, A.; and Bloem, J.
In
Proceedings of the 28th International Conference on Computational Linguistics, pages 6690–6702, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics
Paper
doi
link
bibtex
abstract
4 downloads
@inproceedings{betti_expert_2020,
address = {Barcelona, Spain (Online)},
title = {Expert {Concept}-{Modeling} {Ground} {Truth} {Construction} for {Word} {Embeddings} {Evaluation} in {Concept}-{Focused} {Domains}},
url = {https://www.aclweb.org/anthology/2020.coling-main.586},
doi = {10.18653/v1/2020.coling-main.586},
abstract = {We present a novel, domain expert-controlled, replicable procedure for the construction of concept-modeling ground truths with the aim of evaluating the application of word embeddings. In particular, our method is designed to evaluate the application of word and paragraph embeddings in concept-focused textual domains, where a generic ontology does not provide enough information. We illustrate the procedure, and validate it by describing the construction of an expert ground truth, QuiNE-GT. QuiNE-GT is built to answer research questions concerning the concept of naturalized epistemology in QUINE, a 2-million-token, single-author, 20th-century English philosophy corpus of outstanding quality, cleaned up and enriched for the purpose. To the best of our ken, expert concept-modeling ground truths are extremely rare in current literature, nor has the theoretical methodology behind their construction ever been explicitly conceptualised and properly systematised. Expert-controlled concept-modeling ground truths are however essential to allow proper evaluation of word embeddings techniques, and increase their trustworthiness in specialised domains in which the detection of concepts through their expression in texts is important. We highlight challenges, requirements, and prospects for future work.},
urldate = {2021-01-26},
booktitle = {Proceedings of the 28th {International} {Conference} on {Computational} {Linguistics}},
publisher = {International Committee on Computational Linguistics},
author = {Betti, Arianna and Reynaert, Martin and Ossenkoppele, Thijs and Oortwijn, Yvette and Salway, Andrew and Bloem, Jelke},
month = dec,
year = {2020},
pages = {6690--6702},
}
We present a novel, domain expert-controlled, replicable procedure for the construction of concept-modeling ground truths with the aim of evaluating the application of word embeddings. In particular, our method is designed to evaluate the application of word and paragraph embeddings in concept-focused textual domains, where a generic ontology does not provide enough information. We illustrate the procedure, and validate it by describing the construction of an expert ground truth, QuiNE-GT. QuiNE-GT is built to answer research questions concerning the concept of naturalized epistemology in QUINE, a 2-million-token, single-author, 20th-century English philosophy corpus of outstanding quality, cleaned up and enriched for the purpose. To the best of our ken, expert concept-modeling ground truths are extremely rare in current literature, nor has the theoretical methodology behind their construction ever been explicitly conceptualised and properly systematised. Expert-controlled concept-modeling ground truths are however essential to allow proper evaluation of word embeddings techniques, and increase their trustworthiness in specialised domains in which the detection of concepts through their expression in texts is important. We highlight challenges, requirements, and prospects for future work.
Corpus Building: WorldCat, Part 2.
Betti, A.
June 2020.
Paper
link
bibtex
abstract
@misc{betti_corpus_2020,
title = {Corpus {Building}: {WorldCat}, {Part} 2},
shorttitle = {Corpus {Building}},
url = {https://quine1960.wordpress.com/2020/06/06/corpus-building-worldcat-part-2/},
abstract = {This is Part 2. Go to Part 1.WorldCat’s record identity and relatedness criteria WorldCat clusters records of the same edition of the same work, and links records of different editions of the…},
language = {en},
urldate = {2020-08-15},
journal = {quine1960},
author = {Betti, Arianna},
month = jun,
year = {2020},
}
This is Part 2. Go to Part 1.WorldCat’s record identity and relatedness criteria WorldCat clusters records of the same edition of the same work, and links records of different editions of the…
Corpus Building: WorldCat, Part 1.
Betti, A.
May 2020.
Paper
link
bibtex
abstract
@misc{betti_corpus_2020-1,
title = {Corpus {Building}: {WorldCat}, {Part} 1},
shorttitle = {Corpus {Building}},
url = {https://quine1960.wordpress.com/2020/05/28/corpus-building-worldcat-part-1/},
abstract = {Next: Corpus Building: World Cat, Part 2Suppose you want to put together a corpus of 16th century writings, in particular textbooks, on physics, in Latin. Here’s one method I will call Pseudo…},
language = {en},
urldate = {2020-08-15},
journal = {quine1960},
author = {Betti, Arianna},
month = may,
year = {2020},
}
Next: Corpus Building: World Cat, Part 2Suppose you want to put together a corpus of 16th century writings, in particular textbooks, on physics, in Latin. Here’s one method I will call Pseudo…
Distributional Semantics for Neo-Latin.
Bloem, J.; Parisi, M. C.; Reynaert, M.; Oortwijn, Y.; and Betti, A.
In
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, pages 84–93, Marseille, 2020. European Language Resources Association (ELRA)
Paper
link
bibtex
abstract
4 downloads
@inproceedings{bloem_distributional_2020,
address = {Marseille},
title = {Distributional {Semantics} for {Neo}-{Latin}},
url = {https://www.aclweb.org/anthology/2020.lt4hala-1.13},
abstract = {We address the problem of creating and evaluating quality Neo-Latin word embeddings for the purpose of philosophical research, adapting the Nonce2Vec tool to learn embeddings from Neo-Latin sentences. This distributional semantic modeling tool can learn from tiny data incrementally, using a larger background corpus for initialization. We conduct two evaluation tasks: definitional learning of Latin Wikipedia terms, and learning consistent embeddings from 18th century Neo-Latin sentences pertaining to the concept of mathematical method. Our results show that consistent Neo-Latin word embeddings can be learned from this type of data. While our evaluation results are promising, they do not reveal to what extent the learned models match domain expert knowledge of our Neo-Latin texts. Therefore, we propose an additional evaluation method, grounded in expert-annotated data, that would assess whether learned representations are conceptually sound in relation to the domain of study.},
booktitle = {Proceedings of {LT4HALA} 2020 - 1st {Workshop} on {Language} {Technologies} for {Historical} and {Ancient} {Languages}},
publisher = {European Language Resources Association (ELRA)},
author = {Bloem, Jelke and Parisi, Maria Chiara and Reynaert, Martin and Oortwijn, Yvette and Betti, Arianna},
year = {2020},
pages = {84--93},
}
We address the problem of creating and evaluating quality Neo-Latin word embeddings for the purpose of philosophical research, adapting the Nonce2Vec tool to learn embeddings from Neo-Latin sentences. This distributional semantic modeling tool can learn from tiny data incrementally, using a larger background corpus for initialization. We conduct two evaluation tasks: definitional learning of Latin Wikipedia terms, and learning consistent embeddings from 18th century Neo-Latin sentences pertaining to the concept of mathematical method. Our results show that consistent Neo-Latin word embeddings can be learned from this type of data. While our evaluation results are promising, they do not reveal to what extent the learned models match domain expert knowledge of our Neo-Latin texts. Therefore, we propose an additional evaluation method, grounded in expert-annotated data, that would assess whether learned representations are conceptually sound in relation to the domain of study.