Classification and Clustering of arXiv Documents, Sections, and Abstracts Comparing Encodings of Natural and Mathematical Language

Classification and Clustering of arXiv Documents, Sections, and Abstracts Comparing Encodings of Natural and Mathematical Language. Scharpf, P., Schubotz, M., Youssef, A., Hamborg, F., Meuschke, N., & Gipp, B. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Virtual Event, August, 2020. Venue Rating: CORE A*

Paper

Classification and Clustering of arXiv Documents, Sections, and Abstracts Comparing Encodings of Natural and Mathematical Language [link]

Code doi abstract bibtex 4 downloads

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labelled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.

@inproceedings{ScharpfSYH20,
	address = {Virtual Event},
	title = {Classification and {Clustering} of {arXiv} {Documents}, {Sections}, and {Abstracts} {Comparing} {Encodings} of {Natural} and {Mathematical} {Language}},
	url = {paper=https://www.gipp.com/wp-content/papercite-data/pdf/scharpf2020.pdf code=https://purl.org/class_clust_arxiv_code},
	doi = {10.1145/3383583.3398529},
	abstract = {In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labelled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8\% and cluster purities up to 69.4\% (number of clusters equals number of classes), and 99.9\% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.},
	booktitle = {Proceedings of the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},
	author = {Scharpf, Philipp and Schubotz, Moritz and Youssef, Abdou and Hamborg, Felix and Meuschke, Norman and Gipp, Bela},
	month = aug,
	year = {2020},
	note = {Venue Rating: CORE A*},
	keywords = {Math Information Retrieval},
}

Downloads: 4

{"_id":"9WnJzYJKR9uLaktAk","bibbaseid":"scharpf-schubotz-youssef-hamborg-meuschke-gipp-classificationandclusteringofarxivdocumentssectionsandabstractscomparingencodingsofnaturalandmathematicallanguage-2020","authorIDs":["3aamy24wTzcQoTPGY","7Crs4B84W7BbduMmq","97o4RCsEFAoSxEQqt","9dzP7gNRTLKvc9aPR","GYqCNzAZv2xc9nhmD","KLLNwF6yrTvRfDhAP","LKQ5pS2Y8Pc7FTkr7","TuCkHmKovwKzF3y8Z","ZDet9tokdva7KFSEH","ZJvJiH6kd887XEnz3","gBWY7RvNrDhhspCGi","nLJ4c698vfAyWRWTr","pCb6WupcebiMmhw8Y","qNrPNpAwKg5fp598G","s7Z2R2uTWDHRHN2bE","tFwG3DWb6fYeXs3sL","yiM4TojQ7StGdi2iD"],"author_short":["Scharpf, P.","Schubotz, M.","Youssef, A.","Hamborg, F.","Meuschke, N.","Gipp, B."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Virtual Event","title":"Classification and Clustering of arXiv Documents, Sections, and Abstracts Comparing Encodings of Natural and Mathematical Language","doi":"10.1145/3383583.3398529","abstract":"In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labelled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.","booktitle":"Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL)","author":[{"propositions":[],"lastnames":["Scharpf"],"firstnames":["Philipp"],"suffixes":[]},{"propositions":[],"lastnames":["Schubotz"],"firstnames":["Moritz"],"suffixes":[]},{"propositions":[],"lastnames":["Youssef"],"firstnames":["Abdou"],"suffixes":[]},{"propositions":[],"lastnames":["Hamborg"],"firstnames":["Felix"],"suffixes":[]},{"propositions":[],"lastnames":["Meuschke"],"firstnames":["Norman"],"suffixes":[]},{"propositions":[],"lastnames":["Gipp"],"firstnames":["Bela"],"suffixes":[]}],"month":"August","year":"2020","note":"Venue Rating: CORE A*","keywords":"Math Information Retrieval","bibtex":"@inproceedings{ScharpfSYH20,\n\taddress = {Virtual Event},\n\ttitle = {Classification and {Clustering} of {arXiv} {Documents}, {Sections}, and {Abstracts} {Comparing} {Encodings} of {Natural} and {Mathematical} {Language}},\n\turl = {paper=https://www.gipp.com/wp-content/papercite-data/pdf/scharpf2020.pdf code=https://purl.org/class_clust_arxiv_code},\n\tdoi = {10.1145/3383583.3398529},\n\tabstract = {In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labelled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8\\% and cluster purities up to 69.4\\% (number of clusters equals number of classes), and 99.9\\% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.},\n\tbooktitle = {Proceedings of the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},\n\tauthor = {Scharpf, Philipp and Schubotz, Moritz and Youssef, Abdou and Hamborg, Felix and Meuschke, Norman and Gipp, Bela},\n\tmonth = aug,\n\tyear = {2020},\n\tnote = {Venue Rating: CORE A*},\n\tkeywords = {Math Information Retrieval},\n}\n\n\n\n","author_short":["Scharpf, P.","Schubotz, M.","Youssef, A.","Hamborg, F.","Meuschke, N.","Gipp, B."],"urlpaper":"https://www.gipp.com/wp-content/papercite-data/pdf/scharpf2020.pdf","urlcode":"https://purl.org/class_clust_arxiv_code","key":"ScharpfSYH20","id":"ScharpfSYH20","bibbaseid":"scharpf-schubotz-youssef-hamborg-meuschke-gipp-classificationandclusteringofarxivdocumentssectionsandabstractscomparingencodingsofnaturalandmathematicallanguage-2020","role":"author","urls":{"Paper":"https://www.gipp.com/wp-content/papercite-data/pdf/scharpf2020.pdf","Code":"https://purl.org/class_clust_arxiv_code"},"keyword":["Math Information Retrieval"],"metadata":{"authorlinks":{"meuschke, n":"https://gipplab.uni-goettingen.de/team/dr-norman-meuschke/publications-norman-meuschke/"}},"downloads":4},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero-group/nmeuschke/2532143","creationDate":"2020-04-15T13:02:33.942Z","downloads":4,"keywords":["math information retrieval"],"search_terms":["classification","clustering","arxiv","documents","sections","abstracts","comparing","encodings","natural","mathematical","language","scharpf","schubotz","youssef","hamborg","meuschke","gipp"],"title":"Classification and Clustering of arXiv Documents, Sections, and Abstracts Comparing Encodings of Natural and Mathematical Language","year":2020,"dataSources":["aEHCfX6B2taJt8dfa","9qTaLWxMN5hLpMP8m","xteq4cdC6ATE2G6Fg","JNgeyAG2vQ8k88oYh","FPjHiAkAja6XvmScK","QGwcHf7xnb5mCCQi7","RTGAqwGfLTSqYQMsS","Y7kZGjoN5Erk3Lo2J","yM7MefT3mRkY9m7i4","jnWJCpbQCoWvxj9kz","F32umBkhFrpeJbp7A","BWzEyLkMvdMGpHpr6","e3AdWzdxYmb85Fn5D","MtqPmSRuq4X8FJqNT","YCwvFifyPbazBYMQD","6oZMeYhGKA2Mp8xhF","gYMS6DBXsNosXKcRC","bQwdfx3o8Q3vnsqfH","SzFkcrpurPzNHEyqX","dHLtmS5G7GmooD755","EvZZTzAZvA3EsuMjm","ajaQNNgWhEmTout8A"]}