Analyzing Multi-Task Learning for Abstractive Text Summarization.
Kirstein, F. T.; Wahle, J. P.; Ruas, T.; and Gipp, B.
In
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 54–77, Abu Dhabi, United Arab Emirates (Hybrid), 2022. Association for Computational Linguistics
Paper
doi
link
bibtex
@inproceedings{kirstein_analyzing_2022,
address = {Abu Dhabi, United Arab Emirates (Hybrid)},
title = {Analyzing {Multi}-{Task} {Learning} for {Abstractive} {Text} {Summarization}},
url = {https://aclanthology.org/2022.gem-1.5},
doi = {10.18653/v1/2022.gem-1.5},
language = {en},
urldate = {2023-08-09},
booktitle = {Proceedings of the 2nd {Workshop} on {Natural} {Language} {Generation}, {Evaluation}, and {Metrics} ({GEM})},
publisher = {Association for Computational Linguistics},
author = {Kirstein, Frederic Thomas and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},
year = {2022},
pages = {54--77},
}
A Domain-Adaptive Pre-Training Approach for Language Bias Detection in News.
Krieger, J.; Spinde, T.; Ruas, T.; Kulshrestha, J.; and Gipp, B.
In
Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, of
JCDL '22, New York, NY, USA, June 2022. Association for Computing Machinery
Number of pages: 7 Place: Cologne, Germany tex.articleno: 3
Paper
doi
link
bibtex
abstract
@inproceedings{krieger_domain-adaptive_2022,
address = {New York, NY, USA},
series = {{JCDL} '22},
title = {A {Domain}-{Adaptive} {Pre}-{Training} {Approach} for {Language} {Bias} {Detection} in {News}},
isbn = {978-1-4503-9345-4},
url = {https://doi.org/10.1145/3529372.3530932},
doi = {10.1145/3529372.3530932},
abstract = {Media bias is a multi-faceted construct influencing individual behavior and collective decision-making. Slanted news reporting is the result of one-sided and polarized writing which can occur in various forms. In this work, we focus on an important form of media bias, i.e. bias by word choice. Detecting biased word choices is a challenging task due to its linguistic complexity and the lack of representative gold-standard corpora. We present DA-RoBERTa, a new state-of-the-art transformer-based model adapted to the media bias domain which identifies sentence-level bias with an F1 score of 0.814. In addition, we also train, DA-BERT and DA-BART, two more transformer models adapted to the bias domain. Our proposed domain-adapted models outperform prior bias detection approaches on the same data.},
booktitle = {Proceedings of the 22nd {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries}},
publisher = {Association for Computing Machinery},
author = {Krieger, Jan-David and Spinde, Timo and Ruas, Terry and Kulshrestha, Juhi and Gipp, Bela},
month = jun,
year = {2022},
note = {Number of pages: 7
Place: Cologne, Germany
tex.articleno: 3},
keywords = {domain adaptive, media bias, neural classification, news slant, text analysis},
}
Media bias is a multi-faceted construct influencing individual behavior and collective decision-making. Slanted news reporting is the result of one-sided and polarized writing which can occur in various forms. In this work, we focus on an important form of media bias, i.e. bias by word choice. Detecting biased word choices is a challenging task due to its linguistic complexity and the lack of representative gold-standard corpora. We present DA-RoBERTa, a new state-of-the-art transformer-based model adapted to the media bias domain which identifies sentence-level bias with an F1 score of 0.814. In addition, we also train, DA-BERT and DA-BART, two more transformer models adapted to the bias domain. Our proposed domain-adapted models outperform prior bias detection approaches on the same data.
Identifying Machine-Paraphrased Plagiarism.
Wahle, J. P.; Ruas, T.; Foltýnek, T.; Meuschke, N.; and Gipp, B.
In Smits, M., editor(s),
Information for a Better World: Shaping the Global Future, pages 393–413, Cham, 2022. Springer International Publishing
Paper
doi
link
bibtex
abstract
15 downloads
@inproceedings{wahle_identifying_2022,
address = {Cham},
title = {Identifying {Machine}-{Paraphrased} {Plagiarism}},
isbn = {978-3-030-96957-8},
url = {https://arxiv.org/pdf/2103.11909.pdf},
doi = {10.1007/978-3-030-96957-8_34},
abstract = {Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99\% (F1 = 99.68\% for SpinBot and F1 = 71.64\% for SpinnerChief cases), while human evaluators achieved F1 = 78.4\% for SpinBot and F1 = 65.6\% for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan.},
booktitle = {Information for a {Better} {World}: {Shaping} the {Global} {Future}},
publisher = {Springer International Publishing},
author = {Wahle, Jan Philip and Ruas, Terry and Foltýnek, Tomáš and Meuschke, Norman and Gipp, Bela},
editor = {Smits, Malte},
year = {2022},
pages = {393--413},
}
Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99% (F1 = 99.68% for SpinBot and F1 = 71.64% for SpinnerChief cases), while human evaluators achieved F1 = 78.4% for SpinBot and F1 = 65.6% for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan.
Specialized Document Embeddings for Aspect-Based Similarity of Research Papers.
Ostendorff, M.; Blume, T.; Ruas, T.; Gipp, B.; and Rehm, G.
In
Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, of
JCDL '22, New York, NY, USA, 2022. Association for Computing Machinery
Number of pages: 12 Place: Cologne, Germany tex.articleno: 7
Paper
doi
link
bibtex
abstract
1 download
@inproceedings{ostendorff_specialized_2022,
address = {New York, NY, USA},
series = {{JCDL} '22},
title = {Specialized {Document} {Embeddings} for {Aspect}-{Based} {Similarity} of {Research} {Papers}},
isbn = {978-1-4503-9345-4},
url = {https://doi.org/10.1145/3529372.3530912},
doi = {10.1145/3529372.3530912},
abstract = {Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t. the corpus size. In an empirical study, we use the Papers with Code corpus containing 157, 606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit. This can, for example, be used for more diverse and explainable recommendations.},
booktitle = {Proceedings of the 22nd {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries}},
publisher = {Association for Computing Machinery},
author = {Ostendorff, Malte and Blume, Till and Ruas, Terry and Gipp, Bela and Rehm, Georg},
year = {2022},
note = {Number of pages: 12
Place: Cologne, Germany
tex.articleno: 7},
keywords = {aspect-based similarity, content-based recommender systems, document embeddings, document similarity, papers with code},
}
Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t. the corpus size. In an empirical study, we use the Papers with Code corpus containing 157, 606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit. This can, for example, be used for more diverse and explainable recommendations.
Exploiting Transformer-Based Multitask Learning for the Detection of Media Bias in News Articles.
Spinde, T.; Krieger, J.; Ruas, T.; Mitrović, J.; Götz-Hahn, F.; Aizawa, A.; and Gipp, B.
In Smits, M., editor(s),
Information for a Better World: Shaping the Global Future, volume 13192, pages 225–235, Cham, 2022. Springer International Publishing
Paper
doi
link
bibtex
@inproceedings{smits_exploiting_2022,
address = {Cham},
title = {Exploiting {Transformer}-{Based} {Multitask} {Learning} for the {Detection} of {Media} {Bias} in {News} {Articles}},
volume = {13192},
isbn = {978-3-030-96956-1 978-3-030-96957-8},
url = {https://link.springer.com/10.1007/978-3-030-96957-8_20},
doi = {https://doi.org/10.1007/978-3-030-96957-8_20},
language = {en},
urldate = {2022-03-04},
booktitle = {Information for a {Better} {World}: {Shaping} the {Global} {Future}},
publisher = {Springer International Publishing},
author = {Spinde, Timo and Krieger, Jan-David and Ruas, Terry and Mitrović, Jelena and Götz-Hahn, Franz and Aizawa, Akiko and Gipp, Bela},
editor = {Smits, Malte},
year = {2022},
keywords = {!tr\_author, media\_bias, nlp},
pages = {225--235},
}
D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research.
Wahle, J. P.; Ruas, T.; Mohammad, S.; and Gipp, B.
In
Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2642–2651, Marseille, France, June 2022. European Language Resources Association
Paper
link
bibtex
abstract
@inproceedings{wahle_d3_2022,
address = {Marseille, France},
title = {D3: {A} {Massive} {Dataset} of {Scholarly} {Metadata} for {Analyzing} the {State} of {Computer} {Science} {Research}},
url = {https://aclanthology.org/2022.lrec-1.283},
abstract = {DBLP is the largest open-access repository of scientific articles on computer science and provides metadata associated with publications, authors, and venues. We retrieved more than 6 million publications from DBLP and extracted pertinent metadata (e.g., abstracts, author affiliations, citations) from the publication texts to create the DBLP Discovery Dataset (D3). D3 can be used to identify trends in research activity, productivity, focus, bias, accessibility, and impact of computer science research. We present an initial analysis focused on the volume of computer science research (e.g., number of papers, authors, research activity), trends in topics of interest, and citation patterns. Our findings show that computer science is a growing research field (15\% annually), with an active and collaborative researcher community. While papers in recent years present more bibliographical entries in comparison to previous decades, the average number of citations has been declining. Investigating papers' abstracts reveals that recent topic trends are clearly reflected in D3. Finally, we list further applications of D3 and pose supplemental research questions. The D3 dataset, our findings, and source code are publicly available for research purposes.},
booktitle = {Proceedings of the {Thirteenth} {Language} {Resources} and {Evaluation} {Conference}},
publisher = {European Language Resources Association},
author = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif and Gipp, Bela},
month = jun,
year = {2022},
pages = {2642--2651},
}
DBLP is the largest open-access repository of scientific articles on computer science and provides metadata associated with publications, authors, and venues. We retrieved more than 6 million publications from DBLP and extracted pertinent metadata (e.g., abstracts, author affiliations, citations) from the publication texts to create the DBLP Discovery Dataset (D3). D3 can be used to identify trends in research activity, productivity, focus, bias, accessibility, and impact of computer science research. We present an initial analysis focused on the volume of computer science research (e.g., number of papers, authors, research activity), trends in topics of interest, and citation patterns. Our findings show that computer science is a growing research field (15% annually), with an active and collaborative researcher community. While papers in recent years present more bibliographical entries in comparison to previous decades, the average number of citations has been declining. Investigating papers' abstracts reveals that recent topic trends are clearly reflected in D3. Finally, we list further applications of D3 and pose supplemental research questions. The D3 dataset, our findings, and source code are publicly available for research purposes.
CS-Insights: A System for Analyzing Computer Science Research.
Ruas, T.; Wahle, J. P.; Küll, L.; Mohammad, S. M.; and Gipp, B.
October 2022.
arXiv:2210.06878 [cs]
Paper
link
bibtex
abstract
@misc{ruas_cs-insights_2022,
title = {{CS}-{Insights}: {A} {System} for {Analyzing} {Computer} {Science} {Research}},
shorttitle = {{CS}-{Insights}},
url = {http://arxiv.org/abs/2210.06878},
abstract = {This paper presents CS-Insights, an interactive web application to analyze computer science publications from DBLP through multiple perspectives. The dedicated interfaces allow its users to identify trends in research activity, productivity, accessibility, author's productivity, venues' statistics, topics of interest, and the impact of computer science research on other fields. CS-Insightsis publicly available, and its modular architecture can be easily adapted to domains other than computer science.},
urldate = {2022-10-14},
publisher = {arXiv},
author = {Ruas, Terry and Wahle, Jan Philip and Küll, Lennart and Mohammad, Saif M. and Gipp, Bela},
month = oct,
year = {2022},
note = {arXiv:2210.06878 [cs]},
keywords = {!tr\_author, Computer Science - Computation and Language, Computer Science - Digital Libraries, nlp\_demo, nlp\_scientometric},
}
This paper presents CS-Insights, an interactive web application to analyze computer science publications from DBLP through multiple perspectives. The dedicated interfaces allow its users to identify trends in research activity, productivity, accessibility, author's productivity, venues' statistics, topics of interest, and the impact of computer science research on other fields. CS-Insightsis publicly available, and its modular architecture can be easily adapted to domains other than computer science.