Specialized Document Embeddings for Aspect-Based Similarity of Research Papers. Ostendorff, M., Blume, T., Ruas, T., Gipp, B., & Rehm, G. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, of JCDL '22, New York, NY, USA, 2022. Association for Computing Machinery. Number of pages: 12 Place: Cologne, Germany tex.articleno: 7
Specialized Document Embeddings for Aspect-Based Similarity of Research Papers [link]Paper  doi  abstract   bibtex   1 download  
Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t. the corpus size. In an empirical study, we use the Papers with Code corpus containing 157, 606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit. This can, for example, be used for more diverse and explainable recommendations.
@inproceedings{ostendorff_specialized_2022,
	address = {New York, NY, USA},
	series = {{JCDL} '22},
	title = {Specialized {Document} {Embeddings} for {Aspect}-{Based} {Similarity} of {Research} {Papers}},
	isbn = {978-1-4503-9345-4},
	url = {https://doi.org/10.1145/3529372.3530912},
	doi = {10.1145/3529372.3530912},
	abstract = {Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t. the corpus size. In an empirical study, we use the Papers with Code corpus containing 157, 606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit. This can, for example, be used for more diverse and explainable recommendations.},
	booktitle = {Proceedings of the 22nd {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries}},
	publisher = {Association for Computing Machinery},
	author = {Ostendorff, Malte and Blume, Till and Ruas, Terry and Gipp, Bela and Rehm, Georg},
	year = {2022},
	note = {Number of pages: 12
Place: Cologne, Germany
tex.articleno: 7},
	keywords = {aspect-based similarity, content-based recommender systems, document embeddings, document similarity, papers with code},
}

Downloads: 1