Quine’s semantic space: Count or predict semantic vectors from small in-domain data? [unpublished draft]

Quine’s semantic space: Count or predict semantic vectors from small in-domain data? [unpublished draft]. Bloem, J., Betti, A., Oortwijn, Y., Reynaert, M., & Ossenkoppele, T. In 2020.
abstract bibtex

We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe performance of different types of distributional semantic models in a challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.

@inproceedings{bloem_quines_2020,
	title = {Quine’s semantic space: {Count} or predict semantic vectors from small in-domain data? [unpublished draft]},
	abstract = {We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe  performance  of  different  types  of  distributional  semantic  models  in  a  challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.},
	author = {Bloem, Jelke and Betti, Arianna and Oortwijn, Yvette and Reynaert, Martin and Ossenkoppele, Thijs},
	year = {2020},
}

Downloads: 0

{"_id":"3qQ6jQeRvwZQCb6Bf","bibbaseid":"bloem-betti-oortwijn-reynaert-ossenkoppele-quinessemanticspacecountorpredictsemanticvectorsfromsmallindomaindataunpublisheddraft-2020","author_short":["Bloem, J.","Betti, A.","Oortwijn, Y.","Reynaert, M.","Ossenkoppele, T."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"Quine’s semantic space: Count or predict semantic vectors from small in-domain data? [unpublished draft]","abstract":"We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe performance of different types of distributional semantic models in a challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.","author":[{"propositions":[],"lastnames":["Bloem"],"firstnames":["Jelke"],"suffixes":[]},{"propositions":[],"lastnames":["Betti"],"firstnames":["Arianna"],"suffixes":[]},{"propositions":[],"lastnames":["Oortwijn"],"firstnames":["Yvette"],"suffixes":[]},{"propositions":[],"lastnames":["Reynaert"],"firstnames":["Martin"],"suffixes":[]},{"propositions":[],"lastnames":["Ossenkoppele"],"firstnames":["Thijs"],"suffixes":[]}],"year":"2020","bibtex":"@inproceedings{bloem_quines_2020,\n\ttitle = {Quine’s semantic space: {Count} or predict semantic vectors from small in-domain data? [unpublished draft]},\n\tabstract = {We propose extrinsic evaluation as a solution to the problem of evaluating distributionalsemantic models trained on small, in-domain data sets, where standard datasets of wordsimilarity scores may not capture domain-specific meanings. By relying on a ground truthof text passages relevant to a research question in the history of ideas, we are able to testwhat model best retrieves those passages in an information retrieval task, quantifyingthe performance of different types of distributional semantic models in a challengingdomain. Our results show that count-based models outperform prediction-based modelsas the latter do not generalize well to queries they are not tuned on. The finding thatcount-based models generate better word embeddings from small data in the 1-2 milliontoken range also holds for document embeddings.},\n\tauthor = {Bloem, Jelke and Betti, Arianna and Oortwijn, Yvette and Reynaert, Martin and Ossenkoppele, Thijs},\n\tyear = {2020},\n}\n\n","author_short":["Bloem, J.","Betti, A.","Oortwijn, Y.","Reynaert, M.","Ossenkoppele, T."],"key":"bloem_quines_2020","id":"bloem_quines_2020","bibbaseid":"bloem-betti-oortwijn-reynaert-ossenkoppele-quinessemanticspacecountorpredictsemanticvectorsfromsmallindomaindataunpublisheddraft-2020","role":"author","urls":{},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://api.zotero.org/groups/214700/items?key=bi2Q7duoPuqjf6lgym4TgM83&format=bibtex&limit=100","dataSources":["ZHdsvgePffKRdgqdo","zDZS7QvC6khJT2mcu"],"keywords":[],"search_terms":["quine","semantic","space","count","predict","semantic","vectors","small","domain","data","unpublished","draft","bloem","betti","oortwijn","reynaert","ossenkoppele"],"title":"Quine’s semantic space: Count or predict semantic vectors from small in-domain data? [unpublished draft]","year":2020}