Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction

Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction. Meschenmoser, P., Meuschke, N., Hotz, M., & Gipp, B. In Proceedings of the 5th International Workshop on Mining Scientific Publications (WOSP) held in conjunction with the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Newark, New Jersey, USA, 2016. Venue Rating: CORE A*

Paper

Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction [link]

Code doi abstract bibtex 1 download

Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers' quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.

@inproceedings{MeschenmoserMHG16,
	address = {Newark, New Jersey, USA},
	title = {Scraping {Scientific} {Web} {Repositories}: {Challenges} and {Solutions} for {Automated} {Content} {Extraction}},
	url = {paper=https://www.gipp.com/wp-content/papercite-data/pdf/meschenmoser2016a.pdf code=https://github.com/ag-gipp/grespa},
	doi = {10.1045/september2016-meschenmoser},
	abstract = {Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers' quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.},
	booktitle = {Proceedings of the 5th {International} {Workshop} on {Mining} {Scientific} {Publications} ({WOSP}) held in conjunction with the 16th {ACM}/{IEEE}-{CS} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},
	author = {Meschenmoser, Philipp and Meuschke, Norman and Hotz, Manuel and Gipp, Bela},
	year = {2016},
	note = {Venue Rating: CORE A*},
	keywords = {Miscellaneous},
}

Downloads: 1

{"_id":"DFvBJB9PnkEmNzzMa","bibbaseid":"meschenmoser-meuschke-hotz-gipp-scrapingscientificwebrepositorieschallengesandsolutionsforautomatedcontentextraction-2016","authorIDs":["3aamy24wTzcQoTPGY","7Crs4B84W7BbduMmq","97o4RCsEFAoSxEQqt","9dzP7gNRTLKvc9aPR","GYqCNzAZv2xc9nhmD","KLLNwF6yrTvRfDhAP","LKQ5pS2Y8Pc7FTkr7","TuCkHmKovwKzF3y8Z","ZDet9tokdva7KFSEH","ZJvJiH6kd887XEnz3","gBWY7RvNrDhhspCGi","nLJ4c698vfAyWRWTr","pCb6WupcebiMmhw8Y","qNrPNpAwKg5fp598G","s7Z2R2uTWDHRHN2bE","tFwG3DWb6fYeXs3sL","yiM4TojQ7StGdi2iD"],"author_short":["Meschenmoser, P.","Meuschke, N.","Hotz, M.","Gipp, B."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Newark, New Jersey, USA","title":"Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction","doi":"10.1045/september2016-meschenmoser","abstract":"Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers' quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.","booktitle":"Proceedings of the 5th International Workshop on Mining Scientific Publications (WOSP) held in conjunction with the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)","author":[{"propositions":[],"lastnames":["Meschenmoser"],"firstnames":["Philipp"],"suffixes":[]},{"propositions":[],"lastnames":["Meuschke"],"firstnames":["Norman"],"suffixes":[]},{"propositions":[],"lastnames":["Hotz"],"firstnames":["Manuel"],"suffixes":[]},{"propositions":[],"lastnames":["Gipp"],"firstnames":["Bela"],"suffixes":[]}],"year":"2016","note":"Venue Rating: CORE A*","keywords":"Miscellaneous","bibtex":"@inproceedings{MeschenmoserMHG16,\n\taddress = {Newark, New Jersey, USA},\n\ttitle = {Scraping {Scientific} {Web} {Repositories}: {Challenges} and {Solutions} for {Automated} {Content} {Extraction}},\n\turl = {paper=https://www.gipp.com/wp-content/papercite-data/pdf/meschenmoser2016a.pdf code=https://github.com/ag-gipp/grespa},\n\tdoi = {10.1045/september2016-meschenmoser},\n\tabstract = {Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers' quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.},\n\tbooktitle = {Proceedings of the 5th {International} {Workshop} on {Mining} {Scientific} {Publications} ({WOSP}) held in conjunction with the 16th {ACM}/{IEEE}-{CS} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},\n\tauthor = {Meschenmoser, Philipp and Meuschke, Norman and Hotz, Manuel and Gipp, Bela},\n\tyear = {2016},\n\tnote = {Venue Rating: CORE A*},\n\tkeywords = {Miscellaneous},\n}\n\n\n\n","author_short":["Meschenmoser, P.","Meuschke, N.","Hotz, M.","Gipp, B."],"urlpaper":"https://www.gipp.com/wp-content/papercite-data/pdf/meschenmoser2016a.pdf","urlcode":"https://github.com/ag-gipp/grespa","key":"MeschenmoserMHG16","id":"MeschenmoserMHG16","bibbaseid":"meschenmoser-meuschke-hotz-gipp-scrapingscientificwebrepositorieschallengesandsolutionsforautomatedcontentextraction-2016","role":"author","urls":{"Paper":"https://www.gipp.com/wp-content/papercite-data/pdf/meschenmoser2016a.pdf","Code":"https://github.com/ag-gipp/grespa"},"keyword":["Miscellaneous"],"metadata":{"authorlinks":{"meuschke, n":"https://gipplab.uni-goettingen.de/team/dr-norman-meuschke/publications-norman-meuschke/"}},"downloads":1},"bibtype":"inproceedings","creationDate":"2020-04-15T14:19:03.825Z","downloads":1,"keywords":["miscellaneous"],"search_terms":["scraping","scientific","web","repositories","challenges","solutions","automated","content","extraction","meschenmoser","meuschke","hotz","gipp"],"title":"Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction","year":2016,"biburl":"https://bibbase.org/zotero-group/nmeuschke/2532143","dataSources":["aEHCfX6B2taJt8dfa","9qTaLWxMN5hLpMP8m","xteq4cdC6ATE2G6Fg","JNgeyAG2vQ8k88oYh","FPjHiAkAja6XvmScK","RTGAqwGfLTSqYQMsS","Y7kZGjoN5Erk3Lo2J","yM7MefT3mRkY9m7i4","jnWJCpbQCoWvxj9kz","F32umBkhFrpeJbp7A","BWzEyLkMvdMGpHpr6","e3AdWzdxYmb85Fn5D","MtqPmSRuq4X8FJqNT","YCwvFifyPbazBYMQD","6oZMeYhGKA2Mp8xhF","gYMS6DBXsNosXKcRC","bQwdfx3o8Q3vnsqfH","SzFkcrpurPzNHEyqX","dHLtmS5G7GmooD755","EvZZTzAZvA3EsuMjm","ajaQNNgWhEmTout8A"]}