Evaluating Similarity Measures for Dataset Search

Evaluating Similarity Measures for Dataset Search. Wang, X., Huang, Z., & van Harmelen, F. In Huang, Z., Beek, W., Wang, H., Zhou, R., & Zhang, Y., editors, Web Information Systems Engineering – WISE 2020, of Lecture Notes in Computer Science, pages 38–51. Springer International Publishing.
doi abstract bibtex

Dataset search engines help scientists to find research datasets for scientific experiments. Current dataset search engines are query-driven, making them limited by the appropriate specification of search queries. An alternative would be to adopt a recommendation paradigm (“if you like this dataset, you’ll also like...”). Such a recommendation service requires an appropriate similarity metric between datasets. Various similarity measures have been proposed in computational linguistics and informational retrieval. The goal of this paper is to determine which similarity measure is suitable for a dataset search engine. We will report our experiments on different similarity measures over datasets. We will evaluate these similarity measures against the gold standards which are developed for Elsevier DataSearch, a commercial dataset search engine. With the help of F-measure evaluation measure and nDCG evaluation measure, we find that Wu-Palmer Similarity, a similarity measure which is based on hierarchical terminologies, can score quite good in our benchmarks.

@inproceedings{wang_evaluating_2020,
	location = {Cham},
	title = {Evaluating Similarity Measures for Dataset Search},
	isbn = {978-3-030-62008-0},
	doi = {10.1007/978-3-030-62008-0_3},
	series = {Lecture Notes in Computer Science},
	abstract = {Dataset search engines help scientists to find research datasets for scientific experiments. Current dataset search engines are query-driven, making them limited by the appropriate specification of search queries. An alternative would be to adopt a recommendation paradigm (“if you like this dataset, you’ll also like...”). Such a recommendation service requires an appropriate similarity metric between datasets. Various similarity measures have been proposed in computational linguistics and informational retrieval. The goal of this paper is to determine which similarity measure is suitable for a dataset search engine. We will report our experiments on different similarity measures over datasets. We will evaluate these similarity measures against the gold standards which are developed for Elsevier {DataSearch}, a commercial dataset search engine. With the help of F-measure evaluation measure and {nDCG} evaluation measure, we find that Wu-Palmer Similarity, a similarity measure which is based on hierarchical terminologies, can score quite good in our benchmarks.},
	pages = {38--51},
	booktitle = {Web Information Systems Engineering – {WISE} 2020},
	publisher = {Springer International Publishing},
	author = {Wang, Xu and Huang, Zhisheng and van Harmelen, Frank},
	editor = {Huang, Zhisheng and Beek, Wouter and Wang, Hua and Zhou, Rui and Zhang, Yanchun},
	date = {2020},
	langid = {english},
	keywords = {Data science, Dataset search, Google Distance, Ontology-based similarity, Semantic similarity},
}

Downloads: 0

{"_id":"XzgewnMB8HMupez6m","bibbaseid":"wang-huang-vanharmelen-evaluatingsimilaritymeasuresfordatasetsearch","author_short":["Wang, X.","Huang, Z.","van Harmelen, F."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","location":"Cham","title":"Evaluating Similarity Measures for Dataset Search","isbn":"978-3-030-62008-0","doi":"10.1007/978-3-030-62008-0_3","series":"Lecture Notes in Computer Science","abstract":"Dataset search engines help scientists to find research datasets for scientific experiments. Current dataset search engines are query-driven, making them limited by the appropriate specification of search queries. An alternative would be to adopt a recommendation paradigm (“if you like this dataset, you’ll also like...”). Such a recommendation service requires an appropriate similarity metric between datasets. Various similarity measures have been proposed in computational linguistics and informational retrieval. The goal of this paper is to determine which similarity measure is suitable for a dataset search engine. We will report our experiments on different similarity measures over datasets. We will evaluate these similarity measures against the gold standards which are developed for Elsevier DataSearch, a commercial dataset search engine. With the help of F-measure evaluation measure and nDCG evaluation measure, we find that Wu-Palmer Similarity, a similarity measure which is based on hierarchical terminologies, can score quite good in our benchmarks.","pages":"38–51","booktitle":"Web Information Systems Engineering – WISE 2020","publisher":"Springer International Publishing","author":[{"propositions":[],"lastnames":["Wang"],"firstnames":["Xu"],"suffixes":[]},{"propositions":[],"lastnames":["Huang"],"firstnames":["Zhisheng"],"suffixes":[]},{"propositions":["van"],"lastnames":["Harmelen"],"firstnames":["Frank"],"suffixes":[]}],"editor":[{"propositions":[],"lastnames":["Huang"],"firstnames":["Zhisheng"],"suffixes":[]},{"propositions":[],"lastnames":["Beek"],"firstnames":["Wouter"],"suffixes":[]},{"propositions":[],"lastnames":["Wang"],"firstnames":["Hua"],"suffixes":[]},{"propositions":[],"lastnames":["Zhou"],"firstnames":["Rui"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Yanchun"],"suffixes":[]}],"date":"2020","langid":"english","keywords":"Data science, Dataset search, Google Distance, Ontology-based similarity, Semantic similarity","bibtex":"@inproceedings{wang_evaluating_2020,\n\tlocation = {Cham},\n\ttitle = {Evaluating Similarity Measures for Dataset Search},\n\tisbn = {978-3-030-62008-0},\n\tdoi = {10.1007/978-3-030-62008-0_3},\n\tseries = {Lecture Notes in Computer Science},\n\tabstract = {Dataset search engines help scientists to find research datasets for scientific experiments. Current dataset search engines are query-driven, making them limited by the appropriate specification of search queries. An alternative would be to adopt a recommendation paradigm (“if you like this dataset, you’ll also like...”). Such a recommendation service requires an appropriate similarity metric between datasets. Various similarity measures have been proposed in computational linguistics and informational retrieval. The goal of this paper is to determine which similarity measure is suitable for a dataset search engine. We will report our experiments on different similarity measures over datasets. We will evaluate these similarity measures against the gold standards which are developed for Elsevier {DataSearch}, a commercial dataset search engine. With the help of F-measure evaluation measure and {nDCG} evaluation measure, we find that Wu-Palmer Similarity, a similarity measure which is based on hierarchical terminologies, can score quite good in our benchmarks.},\n\tpages = {38--51},\n\tbooktitle = {Web Information Systems Engineering – {WISE} 2020},\n\tpublisher = {Springer International Publishing},\n\tauthor = {Wang, Xu and Huang, Zhisheng and van Harmelen, Frank},\n\teditor = {Huang, Zhisheng and Beek, Wouter and Wang, Hua and Zhou, Rui and Zhang, Yanchun},\n\tdate = {2020},\n\tlangid = {english},\n\tkeywords = {Data science, Dataset search, Google Distance, Ontology-based similarity, Semantic similarity},\n}\n\n","author_short":["Wang, X.","Huang, Z.","van Harmelen, F."],"editor_short":["Huang, Z.","Beek, W.","Wang, H.","Zhou, R.","Zhang, Y."],"key":"wang_evaluating_2020","id":"wang_evaluating_2020","bibbaseid":"wang-huang-vanharmelen-evaluatingsimilaritymeasuresfordatasetsearch","role":"author","urls":{},"keyword":["Data science","Dataset search","Google Distance","Ontology-based similarity","Semantic similarity"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://raw.githubusercontent.com/KRRVU/website/master/publications/krr.bib","dataSources":["H6xuGqu5uQ6rXhdJ4","dJmTXpbSWWjnxatYT"],"keywords":["data science","dataset search","google distance","ontology-based similarity","semantic similarity"],"search_terms":["evaluating","similarity","measures","dataset","search","wang","huang","van harmelen"],"title":"Evaluating Similarity Measures for Dataset Search","year":null}