Ground Truths in the Humanities.
Oortwijn, Y.; Van den Berg, H.; and Betti, A.
October 2020.
Paper
bibtex
1 download
@misc{oortwijn_ground_2020,
address = {Salt Lake City},
type = {Panel talk},
title = {Ground {Truths} in the {Humanities}},
url = {https://vis4dh.dbvis.de/provocations/},
author = {Oortwijn, Yvette and Van den Berg, Hein and Betti, Arianna},
month = oct,
year = {2020},
}
Expert Concept-Modeling Ground Truth Construction for Word Embeddings Evaluation in Concept-Focused Domains.
Betti, A.; Reynaert, M.; Ossenkoppele, T.; Oortwijn, Y.; Salway, A.; and Bloem, J.
In
Proceedings of the 28th International Conference on Computational Linguistics, pages 6690–6702, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics
Paper
doi
bibtex
abstract
@inproceedings{betti_expert_2020,
address = {Barcelona, Spain (Online)},
title = {Expert {Concept}-{Modeling} {Ground} {Truth} {Construction} for {Word} {Embeddings} {Evaluation} in {Concept}-{Focused} {Domains}},
url = {https://www.aclweb.org/anthology/2020.coling-main.586},
doi = {10.18653/v1/2020.coling-main.586},
abstract = {We present a novel, domain expert-controlled, replicable procedure for the construction of concept-modeling ground truths with the aim of evaluating the application of word embeddings. In particular, our method is designed to evaluate the application of word and paragraph embeddings in concept-focused textual domains, where a generic ontology does not provide enough information. We illustrate the procedure, and validate it by describing the construction of an expert ground truth, QuiNE-GT. QuiNE-GT is built to answer research questions concerning the concept of naturalized epistemology in QUINE, a 2-million-token, single-author, 20th-century English philosophy corpus of outstanding quality, cleaned up and enriched for the purpose. To the best of our ken, expert concept-modeling ground truths are extremely rare in current literature, nor has the theoretical methodology behind their construction ever been explicitly conceptualised and properly systematised. Expert-controlled concept-modeling ground truths are however essential to allow proper evaluation of word embeddings techniques, and increase their trustworthiness in specialised domains in which the detection of concepts through their expression in texts is important. We highlight challenges, requirements, and prospects for future work.},
urldate = {2021-01-26},
booktitle = {Proceedings of the 28th {International} {Conference} on {Computational} {Linguistics}},
publisher = {International Committee on Computational Linguistics},
author = {Betti, Arianna and Reynaert, Martin and Ossenkoppele, Thijs and Oortwijn, Yvette and Salway, Andrew and Bloem, Jelke},
month = dec,
year = {2020},
pages = {6690--6702},
}
We present a novel, domain expert-controlled, replicable procedure for the construction of concept-modeling ground truths with the aim of evaluating the application of word embeddings. In particular, our method is designed to evaluate the application of word and paragraph embeddings in concept-focused textual domains, where a generic ontology does not provide enough information. We illustrate the procedure, and validate it by describing the construction of an expert ground truth, QuiNE-GT. QuiNE-GT is built to answer research questions concerning the concept of naturalized epistemology in QUINE, a 2-million-token, single-author, 20th-century English philosophy corpus of outstanding quality, cleaned up and enriched for the purpose. To the best of our ken, expert concept-modeling ground truths are extremely rare in current literature, nor has the theoretical methodology behind their construction ever been explicitly conceptualised and properly systematised. Expert-controlled concept-modeling ground truths are however essential to allow proper evaluation of word embeddings techniques, and increase their trustworthiness in specialised domains in which the detection of concepts through their expression in texts is important. We highlight challenges, requirements, and prospects for future work.
Bolzano, Kant and the Traditional Theory of Concepts - A Computational Investigation [final author version after R&R submitted 12 Sep, 2020].
Ginammi, A.; Bloem, J.; Koopman, R.; Wang, S.; and Betti, A.
In de Block, A.; and Ramsey, G., editor(s),
The Dynamics of Science: Computational Frontiers in History and Philosophy of Science. Pittsburgh University Press, Pittsburgh, 2020.
bibtex
abstract
@incollection{ginammi_bolzano_2020,
address = {Pittsburgh},
title = {Bolzano, {Kant} and the {Traditional} {Theory} of {Concepts} - {A} {Computational} {Investigation} [final author version after {R}\&{R} submitted 12 {Sep}, 2020]},
abstract = {Abstract
Recent research shows that valuable contributions are obtained by applying even rather simple, well-known computational techniques to texts relevant to the work of researchers in history and philosophy of science (van Wierst et al. 2016). In this paper we substantiate the point by relying on computational text analysis in addressing an open question regarding Bernard Bolzano’s work on the general methodology of the sciences. We investigate to which extent Bolzano followed Kant in seeing concepts as hierarchically ordered by means of definition via compositional analysis by genus proximum and differentia specifica. We show that Bolzano did follow Kant on this traditional doctrine point to a large extent, although Bolzano's conceptual hierarchy is based on subordination rather than composition relations, and that definitions play for Bolzano a merely subjective role. We include a discussion of the computational methodology, and link appendix describing corpus and step-by-step workings of the algorithm applied.},
booktitle = {The {Dynamics} of {Science}: {Computational} {Frontiers} in {History} and {Philosophy} of {Science}},
publisher = {Pittsburgh University Press},
author = {Ginammi, Annapaola and Bloem, Jelke and Koopman, Rob and Wang, Shenghui and Betti, Arianna},
editor = {de Block, Andreas and Ramsey, Grant},
year = {2020},
}
Abstract Recent research shows that valuable contributions are obtained by applying even rather simple, well-known computational techniques to texts relevant to the work of researchers in history and philosophy of science (van Wierst et al. 2016). In this paper we substantiate the point by relying on computational text analysis in addressing an open question regarding Bernard Bolzano’s work on the general methodology of the sciences. We investigate to which extent Bolzano followed Kant in seeing concepts as hierarchically ordered by means of definition via compositional analysis by genus proximum and differentia specifica. We show that Bolzano did follow Kant on this traditional doctrine point to a large extent, although Bolzano's conceptual hierarchy is based on subordination rather than composition relations, and that definitions play for Bolzano a merely subjective role. We include a discussion of the computational methodology, and link appendix describing corpus and step-by-step workings of the algorithm applied.
Expert Concept-Modeling Ground Truth Construction forWord Embeddings Evaluation in Concept-Focused Domains [accepted for COLING2020].
Betti, A.; Reynaert, M.; Ossenkoppele, T.; Oortwijn, Y.; Salway, A.; and Bloem, J.
In
Proceedings of COLING 2020, 2020.
bibtex
abstract
@inproceedings{betti_expert_2020-1,
title = {Expert {Concept}-{Modeling} {Ground} {Truth} {Construction} {forWord} {Embeddings} {Evaluation} in {Concept}-{Focused} {Domains} [accepted for {COLING2020}]},
abstract = {We present a novel, domain expert-controlled, replicable procedure for the constructionof concept-modeling ground truths to the aim of evaluating the application of word em-beddings in concept-focused textual domains. We illustrate the procedure by describingthe construction of a threefold expert ground truth built to answer research questionsconcerning the concept ofsciencein the Quine corpus, a 2-million-token, single-author,20th-century English philosophy corpus of outstanding quality, cleaned up and enriched forthe purpose. To the best of our ken, expert concept-modeling ground truths are extremelyrare in current literature, nor has the theoretical methodology behind their construc-tion ever been explicitly conceptualised and properly systematised. Expert-controlledconcept-modeling ground truths are however essential to allow proper evaluation of wordembeddings techniques, and increase their trustworthiness in specialised domains in whichthe detection of concepts through their expression in texts is important. We highlightchallenges, requirements, and prospects for future work.},
booktitle = {Proceedings of {COLING} 2020},
author = {Betti, Arianna and Reynaert, Martin and Ossenkoppele, Thijs and Oortwijn, Yvette and Salway, Andrew and Bloem, Jelke},
year = {2020},
}
We present a novel, domain expert-controlled, replicable procedure for the constructionof concept-modeling ground truths to the aim of evaluating the application of word em-beddings in concept-focused textual domains. We illustrate the procedure by describingthe construction of a threefold expert ground truth built to answer research questionsconcerning the concept ofsciencein the Quine corpus, a 2-million-token, single-author,20th-century English philosophy corpus of outstanding quality, cleaned up and enriched forthe purpose. To the best of our ken, expert concept-modeling ground truths are extremelyrare in current literature, nor has the theoretical methodology behind their construc-tion ever been explicitly conceptualised and properly systematised. Expert-controlledconcept-modeling ground truths are however essential to allow proper evaluation of wordembeddings techniques, and increase their trustworthiness in specialised domains in whichthe detection of concepts through their expression in texts is important. We highlightchallenges, requirements, and prospects for future work.
Quine’s semantic space: Count or predict semantic vectors from small in-domain data? [draft].
Bloem, J.; Betti, A.; Oortwijn, Y.; Reynaert, M.; and Ossenkoppele, T.
In 2020.
bibtex
abstract
@inproceedings{bloem_quines_2020,
title = {Quine’s semantic space: {Count} or predict semantic vectors from small in-domain data? [draft]},
abstract = {We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe performance of different types of distributional semantic models in a challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.},
author = {Bloem, Jelke and Betti, Arianna and Oortwijn, Yvette and Reynaert, Martin and Ossenkoppele, Thijs},
year = {2020},
}
We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe performance of different types of distributional semantic models in a challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.
Corpus Building: WorldCat, Part 2.
Betti, A.
June 2020.
Paper
bibtex
abstract
@misc{betti_corpus_2020,
title = {Corpus {Building}: {WorldCat}, {Part} 2},
shorttitle = {Corpus {Building}},
url = {https://quine1960.wordpress.com/2020/06/06/corpus-building-worldcat-part-2/},
abstract = {This is Part 2. Go to Part 1.WorldCat’s record identity and relatedness criteria WorldCat clusters records of the same edition of the same work, and links records of different editions of the…},
language = {en},
urldate = {2020-08-15},
journal = {quine1960},
author = {Betti, Arianna},
month = jun,
year = {2020},
}
This is Part 2. Go to Part 1.WorldCat’s record identity and relatedness criteria WorldCat clusters records of the same edition of the same work, and links records of different editions of the…
Corpus Building: WorldCat, Part 1.
Betti, A.
May 2020.
Paper
bibtex
abstract
@misc{betti_corpus_2020-1,
title = {Corpus {Building}: {WorldCat}, {Part} 1},
shorttitle = {Corpus {Building}},
url = {https://quine1960.wordpress.com/2020/05/28/corpus-building-worldcat-part-1/},
abstract = {Next: Corpus Building: World Cat, Part 2Suppose you want to put together a corpus of 16th century writings, in particular textbooks, on physics, in Latin. Here’s one method I will call Pseudo…},
language = {en},
urldate = {2020-08-15},
journal = {quine1960},
author = {Betti, Arianna},
month = may,
year = {2020},
}
Next: Corpus Building: World Cat, Part 2Suppose you want to put together a corpus of 16th century writings, in particular textbooks, on physics, in Latin. Here’s one method I will call Pseudo…
Distributional Semantics for Neo-Latin.
Bloem, J.; Parisi, M. C.; Reynaert, M.; Oortwijn, Y.; and Betti, A.
In
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, pages 84–93, Marseille, 2020. European Language Resources Association (ELRA)
Paper
bibtex
abstract
3 downloads
@inproceedings{bloem_distributional_2020,
address = {Marseille},
title = {Distributional {Semantics} for {Neo}-{Latin}},
url = {https://www.aclweb.org/anthology/2020.lt4hala-1.13},
abstract = {We address the problem of creating and evaluating quality Neo-Latin word embeddings for the purpose of philosophical research, adapting the Nonce2Vec tool to learn embeddings from Neo-Latin sentences. This distributional semantic modeling tool can learn from tiny data incrementally, using a larger background corpus for initialization. We conduct two evaluation tasks: definitional learning of Latin Wikipedia terms, and learning consistent embeddings from 18th century Neo-Latin sentences pertaining to the concept of mathematical method. Our results show that consistent Neo-Latin word embeddings can be learned from this type of data. While our evaluation results are promising, they do not reveal to what extent the learned models match domain expert knowledge of our Neo-Latin texts. Therefore, we propose an additional evaluation method, grounded in expert-annotated data, that would assess whether learned representations are conceptually sound in relation to the domain of study.},
booktitle = {Proceedings of {LT4HALA} 2020 - 1st {Workshop} on {Language} {Technologies} for {Historical} and {Ancient} {Languages}},
publisher = {European Language Resources Association (ELRA)},
author = {Bloem, Jelke and Parisi, Maria Chiara and Reynaert, Martin and Oortwijn, Yvette and Betti, Arianna},
year = {2020},
pages = {84--93},
}
We address the problem of creating and evaluating quality Neo-Latin word embeddings for the purpose of philosophical research, adapting the Nonce2Vec tool to learn embeddings from Neo-Latin sentences. This distributional semantic modeling tool can learn from tiny data incrementally, using a larger background corpus for initialization. We conduct two evaluation tasks: definitional learning of Latin Wikipedia terms, and learning consistent embeddings from 18th century Neo-Latin sentences pertaining to the concept of mathematical method. Our results show that consistent Neo-Latin word embeddings can be learned from this type of data. While our evaluation results are promising, they do not reveal to what extent the learned models match domain expert knowledge of our Neo-Latin texts. Therefore, we propose an additional evaluation method, grounded in expert-annotated data, that would assess whether learned representations are conceptually sound in relation to the domain of study.