\n \n \n
\n
\n\n \n \n \n \n \n \n Big Tech-Funded AI Papers Have Higher Citation Impact, Greater Insularity, and Larger Recency Bias.\n \n \n \n \n\n\n \n Gnewuch, M. M.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n December 2025.\n
arXiv:2512.05714 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{gnewuch_big_2025,\n\ttitle = {Big {Tech}-{Funded} {AI} {Papers} {Have} {Higher} {Citation} {Impact}, {Greater} {Insularity}, and {Larger} {Recency} {Bias}},\n\turl = {http://arxiv.org/abs/2512.05714},\n\tdoi = {10.48550/arXiv.2512.05714},\n\tabstract = {Over the past four decades, artificial intelligence (AI) research has flourished at the nexus of academia and industry. However, Big Tech companies have increasingly acquired the edge in computational resources, big data, and talent. So far, it has been largely unclear how many papers the industry funds, how their citation impact compares to non-funded papers, and what drives industry interest. This study fills that gap by quantifying the number of industry-funded papers at 10 top AI conferences (e.g., ICLR, CVPR, AAAI, ACL) and their citation influence. We analyze about 49.8K papers, about 1.8M citations from AI papers to other papers, and about 2.3M citations from other papers to AI papers from 1998-2022 in Scopus. Through seven research questions, we examine the volume and evolution of industry funding in AI research, the citation impact of funded papers, the diversity and temporal range of their citations, and the subfields in which industry predominantly acts. Our findings reveal that industry presence has grown markedly since 2015, from less than 2 percent to more than 11 percent in 2020. Between 2018 and 2022, 12 percent of industry-funded papers achieved high citation rates as measured by the h5-index, compared to 4 percent of non-industry-funded papers and 2 percent of non-funded papers. Top AI conferences engage more with industry-funded research than non-funded research, as measured by our newly proposed metric, the Citation Preference Ratio (CPR). We show that industry-funded research is increasingly insular, citing predominantly other industry-funded papers while referencing fewer non-funded papers. These findings reveal new trends in AI research funding, including a shift towards more industry-funded papers and their growing citation impact, greater insularity of industry-funded work than non-funded work, and a preference of industry-funded research to cite recent work.},\n\turldate = {2025-12-08},\n\tpublisher = {arXiv},\n\tauthor = {Gnewuch, Max Martin and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tmonth = dec,\n\tyear = {2025},\n\tnote = {arXiv:2512.05714 [cs]},\n\tkeywords = {!tr\\_author, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Digital Libraries, scientometrics},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n
\n Over the past four decades, artificial intelligence (AI) research has flourished at the nexus of academia and industry. However, Big Tech companies have increasingly acquired the edge in computational resources, big data, and talent. So far, it has been largely unclear how many papers the industry funds, how their citation impact compares to non-funded papers, and what drives industry interest. This study fills that gap by quantifying the number of industry-funded papers at 10 top AI conferences (e.g., ICLR, CVPR, AAAI, ACL) and their citation influence. We analyze about 49.8K papers, about 1.8M citations from AI papers to other papers, and about 2.3M citations from other papers to AI papers from 1998-2022 in Scopus. Through seven research questions, we examine the volume and evolution of industry funding in AI research, the citation impact of funded papers, the diversity and temporal range of their citations, and the subfields in which industry predominantly acts. Our findings reveal that industry presence has grown markedly since 2015, from less than 2 percent to more than 11 percent in 2020. Between 2018 and 2022, 12 percent of industry-funded papers achieved high citation rates as measured by the h5-index, compared to 4 percent of non-industry-funded papers and 2 percent of non-funded papers. Top AI conferences engage more with industry-funded research than non-funded research, as measured by our newly proposed metric, the Citation Preference Ratio (CPR). We show that industry-funded research is increasingly insular, citing predominantly other industry-funded papers while referencing fewer non-funded papers. These findings reveal new trends in AI research funding, including a shift towards more industry-funded papers and their growing citation impact, greater insularity of industry-funded work than non-funded work, and a preference of industry-funded research to cite recent work.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents.\n \n \n \n \n\n\n \n Yang, T.; Ruas, T.; Tian, Y.; Wahle, J. P.; Kurzawe, D.; and Gipp, B.\n\n\n \n\n\n\n October 2025.\n
arXiv:2510.25668 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{yang_alden_2025,\n\ttitle = {{ALDEN}: {Reinforcement} {Learning} for {Active} {Navigation} and {Evidence} {Gathering} in {Long} {Documents}},\n\tshorttitle = {{ALDEN}},\n\turl = {http://arxiv.org/abs/2510.25668},\n\tdoi = {10.48550/arXiv.2510.25668},\n\tabstract = {Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.},\n\turldate = {2025-12-01},\n\tpublisher = {arXiv},\n\tauthor = {Yang, Tianyu and Ruas, Terry and Tian, Yijun and Wahle, Jan Philip and Kurzawe, Daniel and Gipp, Bela},\n\tmonth = oct,\n\tyear = {2025},\n\tnote = {arXiv:2510.25668 [cs]},\n\tkeywords = {!tr\\_author, Computer Science - Artificial Intelligence, Computer Science - Multimedia, ir, nlp\\_LLM, nlp\\_rl},\n}\n\n\n\n\n
\n\n\n
\n Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Overview of the Plagiarism Detection Task at PAN 2025.\n \n \n \n \n\n\n \n Greiner-Petter, A.; Fröbe, M.; Wahle, J. P.; Ruas, T.; Gipp, B.; Aizawa, A.; and Potthast, M.\n\n\n \n\n\n\n October 2025.\n
arXiv:2510.06805 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{greiner-petter_overview_2025,\n\ttitle = {Overview of the {Plagiarism} {Detection} {Task} at {PAN} 2025},\n\turl = {http://arxiv.org/abs/2510.06805},\n\tdoi = {10.48550/arXiv.2510.06805},\n\tabstract = {The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.},\n\turldate = {2025-12-01},\n\tpublisher = {arXiv},\n\tauthor = {Greiner-Petter, André and Fröbe, Maik and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela and Aizawa, Akiko and Potthast, Martin},\n\tmonth = oct,\n\tyear = {2025},\n\tnote = {arXiv:2510.06805 [cs]},\n\tkeywords = {!tr\\_author, Computer Science - Computation and Language, Computer Science - Information Retrieval, nlp\\_plagiarism, workshop},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n
\n The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent.\n \n \n \n \n\n\n \n Meier, D.; Wahle, J. P.; Röttger, P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27232–27249, Suzhou, China, 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{meier_trojanstego_2025,\n\taddress = {Suzhou, China},\n\ttitle = {{TrojanStego}: {Your} {Language} {Model} {Can} {Secretly} {Be} {A} {Steganographic} {Privacy} {Leaking} {Agent}},\n\tshorttitle = {{TrojanStego}},\n\turl = {https://aclanthology.org/2025.emnlp-main.1386},\n\tdoi = {10.18653/v1/2025.emnlp-main.1386},\n\tlanguage = {en},\n\turldate = {2025-11-11},\n\tbooktitle = {Proceedings of the 2025 {Conference} on {Empirical} {Methods} in {Natural} {Language} {Processing}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Meier, Dominik and Wahle, Jan Philip and Röttger, Paul and Ruas, Terry and Gipp, Bela},\n\tyear = {2025},\n\tpages = {27232--27249},\n}\n\n\n\n\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions.\n \n \n \n \n\n\n \n Kirstein, F.; Kumar, S.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In
Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20087–20137, Suzhou, China, 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kirstein_re-frame_2025,\n\taddress = {Suzhou, China},\n\ttitle = {Re-{FRAME} the {Meeting} {Summarization} {SCOPE}: {Fact}-{Based} {Summarization} and {Personalization} via {Questions}},\n\tshorttitle = {Re-{FRAME} the {Meeting} {Summarization} {SCOPE}},\n\turl = {https://aclanthology.org/2025.findings-emnlp.1094},\n\tdoi = {10.18653/v1/2025.findings-emnlp.1094},\n\tlanguage = {en},\n\turldate = {2025-11-11},\n\tbooktitle = {Findings of the {Association} for {Computational} {Linguistics}: {EMNLP} 2025},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kirstein, Frederic and Kumar, Sonu and Ruas, Terry and Gipp, Bela},\n\tyear = {2025},\n\tpages = {20087--20137},\n}\n\n\n\n\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n SPaRC: A Spatial Pathfinding Reasoning Challenge.\n \n \n \n \n\n\n \n Kaesberg, L. B.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10370–10401, Suzhou, China, 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kaesberg_sparc_2025,\n\taddress = {Suzhou, China},\n\ttitle = {{SPaRC}: {A} {Spatial} {Pathfinding} {Reasoning} {Challenge}},\n\tshorttitle = {{SPaRC}},\n\turl = {https://aclanthology.org/2025.emnlp-main.526},\n\tdoi = {10.18653/v1/2025.emnlp-main.526},\n\tlanguage = {en},\n\turldate = {2025-11-11},\n\tbooktitle = {Proceedings of the 2025 {Conference} on {Empirical} {Methods} in {Natural} {Language} {Processing}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kaesberg, Lars Benedikt and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tyear = {2025},\n\tpages = {10370--10401},\n}\n\n\n\n\n\n\n\n\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n MALLM: Multi-Agent Large Language Models Framework.\n \n \n \n \n\n\n \n Becker, J.; Kaesberg, L. B.; Bauer, N.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 418–439, Suzhou, China, 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{becker_mallm_2025,\n\taddress = {Suzhou, China},\n\ttitle = {{MALLM}: {Multi}-{Agent} {Large} {Language} {Models} {Framework}},\n\tshorttitle = {{MALLM}},\n\turl = {https://aclanthology.org/2025.emnlp-demos.29},\n\tdoi = {10.18653/v1/2025.emnlp-demos.29},\n\tlanguage = {en},\n\turldate = {2025-11-11},\n\tbooktitle = {Proceedings of the 2025 {Conference} on {Empirical} {Methods} in {Natural} {Language} {Processing}: {System} {Demonstrations}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Becker, Jonas and Kaesberg, Lars Benedikt and Bauer, Niklas and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tyear = {2025},\n\tpages = {418--439},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with Multi-Agent Conversations.\n \n \n \n \n\n\n \n Kirstein, F.; Khan, M.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In
Findings of the Association for Computational Linguistics: ACL 2025, pages 11482–11525, Vienna, Austria, 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kirstein_you_2025,\n\taddress = {Vienna, Austria},\n\ttitle = {You need to {MIMIC} to get {FAME}: {Solving} {Meeting} {Transcript} {Scarcity} with {Multi}-{Agent} {Conversations}},\n\tshorttitle = {You need to {MIMIC} to get {FAME}},\n\turl = {https://aclanthology.org/2025.findings-acl.599},\n\tdoi = {10.18653/v1/2025.findings-acl.599},\n\tlanguage = {en},\n\turldate = {2025-09-01},\n\tbooktitle = {Findings of the {Association} for {Computational} {Linguistics}: {ACL} 2025},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kirstein, Frederic and Khan, Muneeb and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tyear = {2025},\n\tpages = {11482--11525},\n}\n\n\n\n\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Voting or Consensus? Decision-Making in Multi-Agent Debate.\n \n \n \n \n\n\n \n Kaesberg, L. B.; Becker, J.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In
Findings of the Association for Computational Linguistics: ACL 2025, pages 11640–11671, Vienna, Austria, 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kaesberg_voting_2025,\n\taddress = {Vienna, Austria},\n\ttitle = {Voting or {Consensus}? {Decision}-{Making} in {Multi}-{Agent} {Debate}},\n\tshorttitle = {Voting or {Consensus}?},\n\turl = {https://aclanthology.org/2025.findings-acl.606},\n\tdoi = {10.18653/v1/2025.findings-acl.606},\n\tlanguage = {en},\n\turldate = {2025-09-01},\n\tbooktitle = {Findings of the {Association} for {Computational} {Linguistics}: {ACL} 2025},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kaesberg, Lars Benedikt and Becker, Jonas and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tyear = {2025},\n\tpages = {11640--11671},\n}\n\n\n\n\n\n\n\n\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages.\n \n \n \n \n\n\n \n Muhammad, S. H.; Ousidhoum, N.; Abdulmumin, I.; Wahle, J. P.; Ruas, T.; Beloucif, M.; De Kock, C.; Surange, N.; Teodorescu, D.; Ahmad, I. S.; Adelani, D. I.; Aji, A. F.; Ali, F. D. M. A.; Alimova, I.; Araujo, V.; Babakov, N.; Baes, N.; Bucur, A.; Bukula, A.; Cao, G.; Tufiño, R.; Chevi, R.; Chukwuneke, C. I.; Ciobotaru, A.; Dementieva, D.; Gadanya, M. S.; Geislinger, R.; Gipp, B.; Hourrane, O.; Ignat, O.; Lawan, F. I.; Mabuya, R.; Mahendra, R.; Marivate, V.; Panchenko, A.; Piper, A.; Ferreira, C. H. P.; Protasov, V.; Rutunda, S.; Shrivastava, M.; Udrea, A. C.; Wanzare, L. D. A.; Wu, S.; Wunderlich, F. V.; Zhafran, H. M.; Zhang, T.; Zhou, Y.; and Mohammad, S. M.\n\n\n \n\n\n\n In
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8895–8916, Vienna, Austria, 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{muhammad_brighter_2025,\n\taddress = {Vienna, Austria},\n\ttitle = {{BRIGHTER}: {BRIdging} the {Gap} in {Human}-{Annotated} {Textual} {Emotion} {Recognition} {Datasets} for 28 {Languages}},\n\tshorttitle = {{BRIGHTER}},\n\turl = {https://aclanthology.org/2025.acl-long.436},\n\tdoi = {10.18653/v1/2025.acl-long.436},\n\tlanguage = {en},\n\turldate = {2025-09-01},\n\tbooktitle = {Proceedings of the 63rd {Annual} {Meeting} of the {Association} for {Computational} {Linguistics} ({Volume} 1: {Long} {Papers})},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Muhammad, Shamsuddeen Hassan and Ousidhoum, Nedjma and Abdulmumin, Idris and Wahle, Jan Philip and Ruas, Terry and Beloucif, Meriem and De Kock, Christine and Surange, Nirmal and Teodorescu, Daniela and Ahmad, Ibrahim Said and Adelani, David Ifeoluwa and Aji, Alham Fikri and Ali, Felermino D. M. A. and Alimova, Ilseyar and Araujo, Vladimir and Babakov, Nikolay and Baes, Naomi and Bucur, Ana-Maria and Bukula, Andiswa and Cao, Guanqun and Tufiño, Rodrigo and Chevi, Rendi and Chukwuneke, Chiamaka Ijeoma and Ciobotaru, Alexandra and Dementieva, Daryna and Gadanya, Murja Sani and Geislinger, Robert and Gipp, Bela and Hourrane, Oumaima and Ignat, Oana and Lawan, Falalu Ibrahim and Mabuya, Rooweither and Mahendra, Rahmad and Marivate, Vukosi and Panchenko, Alexander and Piper, Andrew and Ferreira, Charles Henrique Porto and Protasov, Vitaly and Rutunda, Samuel and Shrivastava, Manish and Udrea, Aura Cristina and Wanzare, Lilian Diana Awuor and Wu, Sophie and Wunderlich, Florian Valentin and Zhafran, Hanif Muhammad and Zhang, Tianhui and Zhou, Yi and Mohammad, Saif M.},\n\tyear = {2025},\n\tpages = {8895--8916},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language.\n \n \n \n \n\n\n \n Zhukova, A.; Matt, C. E.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n April 2025.\n
arXiv:2504.19856 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{zhukova_efficient_2025,\n\ttitle = {Efficient {Domain}-adaptive {Continual} {Pretraining} for the {Process} {Industry} in the {German} {Language}},\n\turl = {http://arxiv.org/abs/2504.19856},\n\tdoi = {10.48550/arXiv.2504.19856},\n\tabstract = {Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., language masking. Although popular, it requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that this approach performs better than traditional DAPT by 3.5 points of the average IR metrics (e.g., mAP, MRR, and nDCG) and requires almost 4 times less computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.},\n\turldate = {2025-06-25},\n\tpublisher = {arXiv},\n\tauthor = {Zhukova, Anastasia and Matt, Christian E. and Ruas, Terry and Gipp, Bela},\n\tmonth = apr,\n\tyear = {2025},\n\tnote = {arXiv:2504.19856 [cs]},\n\tkeywords = {!tr\\_author, Computer Science - Computation and Language, nlp\\_domain\\_adaptation, nlp\\_plant\\_assist},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n
\n Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., language masking. Although popular, it requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that this approach performs better than traditional DAPT by 3.5 points of the average IR metrics (e.g., mAP, MRR, and nDCG) and requires almost 4 times less computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection.\n \n \n \n \n\n\n \n Muhammad, S. H.; Ousidhoum, N.; Abdulmumin, I.; Yimam, S. M.; Wahle, J. P.; Ruas, T.; Beloucif, M.; Kock, C. D.; Belay, T. D.; Ahmad, I. S.; Surange, N.; Teodorescu, D.; Adelani, D. I.; Aji, A. F.; Ali, F.; Araujo, V.; Ayele, A. A.; Ignat, O.; Panchenko, A.; Zhou, Y.; and Mohammad, S. M.\n\n\n \n\n\n\n March 2025.\n
arXiv:2503.07269 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{muhammad_semeval-2025_2025,\n\ttitle = {{SemEval}-2025 {Task} 11: {Bridging} the {Gap} in {Text}-{Based} {Emotion} {Detection}},\n\tshorttitle = {{SemEval}-2025 {Task} 11},\n\turl = {http://arxiv.org/abs/2503.07269},\n\tdoi = {10.48550/arXiv.2503.07269},\n\tabstract = {We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and spoken across various continents. The data instances are multi-labeled into six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) emotion labels in monolingual settings, (b) emotion intensity scores, and (c) emotion labels in cross-lingual settings. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available.},\n\turldate = {2025-03-11},\n\tpublisher = {arXiv},\n\tauthor = {Muhammad, Shamsuddeen Hassan and Ousidhoum, Nedjma and Abdulmumin, Idris and Yimam, Seid Muhie and Wahle, Jan Philip and Ruas, Terry and Beloucif, Meriem and Kock, Christine De and Belay, Tadesse Destaw and Ahmad, Ibrahim Said and Surange, Nirmal and Teodorescu, Daniela and Adelani, David Ifeoluwa and Aji, Alham Fikri and Ali, Felermino and Araujo, Vladimir and Ayele, Abinew Ali and Ignat, Oana and Panchenko, Alexander and Zhou, Yi and Mohammad, Saif M.},\n\tmonth = mar,\n\tyear = {2025},\n\tnote = {arXiv:2503.07269 [cs]},\n\tkeywords = {!tr\\_author, Computer Science - Computation and Language, semeval},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n
\n We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and spoken across various continents. The data instances are multi-labeled into six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) emotion labels in monolingual settings, (b) emotion intensity scores, and (c) emotion labels in cross-lingual settings. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator.\n \n \n \n \n\n\n \n Kirstein, F. T.; Lima Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; Schockaert, S.; Darwish, K.; and Agarwal, A., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 561–574, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 2 downloads\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kirstein-etal-2025-meeting,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {Is my {Meeting} {Summary} {Good}? {Estimating} {Quality} with a {Multi}-{LLM} {Evaluator}},\n\turl = {https://aclanthology.org/2025.coling-industry.48/},\n\tabstract = {The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA`s components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework`s flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}: {Industry} {Track}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kirstein, Frederic Thomas and Lima Ruas, Terry and Gipp, Bela},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven and Darwish, Kareem and Agarwal, Apoorv},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {561--574},\n}\n\n\n\n\n\n\n\n\n
\n\n\n
\n The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA`s components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework`s flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Towards Human Understanding of Paraphrase Types in Large Language Models.\n \n \n \n \n\n\n \n Meier, D.; Wahle, J. P.; Lima Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; and Schockaert, S., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics, pages 6298–6316, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{meier-etal-2025-towards,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {Towards {Human} {Understanding} of {Paraphrase} {Types} in {Large} {Language} {Models}},\n\turl = {https://aclanthology.org/2025.coling-main.421/},\n\tabstract = {Paraphrases represent a human`s intuitive ability to understand expressions presented in various different ways. Current paraphrase evaluations of language models primarily use binary approaches, offering limited interpretability of specific text changes. Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes and offer a granular view of the flexibility in linguistic expression (e.g., a shift in syntax or vocabulary used). In this study, we assess the human preferences towards ChatGPT in generating English paraphrases with ten APTs and five prompting techniques. We introduce APTY (Atomic Paraphrase TYpes), a dataset of 800 sentence-level and word-level annotations by 15 annotators. The dataset also provides a human preference ranking of paraphrases with different types that can be used to fine-tune models with RLHF and DPO methods. Our results reveal that ChatGPT and a DPO-trained LLama 7B model can generate simple APTs, such as additions and deletions, but struggle with complex structures (e.g., subordination changes). This study contributes to understanding which aspects of paraphrasing language models have already succeeded at understanding and what remains elusive. In addition, we show how our curated datasets can be used to develop language models with specific linguistic capabilities.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Meier, Dominik and Wahle, Jan Philip and Lima Ruas, Terry and Gipp, Bela},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {6298--6316},\n}\n\n\n\n\n\n\n\n\n
\n\n\n
\n Paraphrases represent a human`s intuitive ability to understand expressions presented in various different ways. Current paraphrase evaluations of language models primarily use binary approaches, offering limited interpretability of specific text changes. Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes and offer a granular view of the flexibility in linguistic expression (e.g., a shift in syntax or vocabulary used). In this study, we assess the human preferences towards ChatGPT in generating English paraphrases with ten APTs and five prompting techniques. We introduce APTY (Atomic Paraphrase TYpes), a dataset of 800 sentence-level and word-level annotations by 15 annotators. The dataset also provides a human preference ranking of paraphrases with different types that can be used to fine-tune models with RLHF and DPO methods. Our results reveal that ChatGPT and a DPO-trained LLama 7B model can generate simple APTs, such as additions and deletions, but struggle with complex structures (e.g., subordination changes). This study contributes to understanding which aspects of paraphrasing language models have already succeeded at understanding and what remains elusive. In addition, we show how our curated datasets can be used to develop language models with specific linguistic capabilities.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n What`s Wrong? Refining Meeting Summaries with LLM Feedback.\n \n \n \n \n\n\n \n Kirstein, F. T.; Lima Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; and Schockaert, S., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics, pages 2100–2120, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kirstein-etal-2025-whats,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {What`s {Wrong}? {Refining} {Meeting} {Summaries} with {LLM} {Feedback}},\n\turl = {https://aclanthology.org/2025.coling-main.143/},\n\tabstract = {Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kirstein, Frederic Thomas and Lima Ruas, Terry and Gipp, Bela},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {2100--2120},\n}\n\n\n\n\n\n\n\n\n
\n\n\n
\n Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Citation Amnesia: On The Recency Bias of NLP and Other Academic Fields.\n \n \n \n \n\n\n \n Wahle, J. P.; Lima Ruas, T.; Abdalla, M.; Gipp, B.; and Mohammad, S. M.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; and Schockaert, S., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics, pages 1027–1044, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{wahle-etal-2025-citation,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {Citation {Amnesia}: {On} {The} {Recency} {Bias} of {NLP} and {Other} {Academic} {Fields}},\n\turl = {https://aclanthology.org/2025.coling-main.69/},\n\tabstract = {This study examines the tendency to cite older work across 20 fields of study over 43 years (1980–2023). We put NLP`s propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to them over time or whether differences can be observed. Our analysis, based on a dataset of {\\textasciitilde}240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). The trend is strongest in NLP and ML research (-12.8\\% and -5.5\\% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4\\% across fields; -5.2\\% in humanities; -5.5\\% in formal sciences) — even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community`s engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Wahle, Jan Philip and Lima Ruas, Terry and Abdalla, Mohamed and Gipp, Bela and Mohammad, Saif M.},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {1027--1044},\n}\n\n\n\n\n
\n\n\n
\n This study examines the tendency to cite older work across 20 fields of study over 43 years (1980–2023). We put NLP`s propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to them over time or whether differences can be observed. Our analysis, based on a dataset of ~240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). The trend is strongest in NLP and ML research (-12.8% and -5.5% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) — even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community`s engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Stay Focused: Problem Drift in Multi-Agent Debate.\n \n \n \n \n\n\n \n Becker, J.; Kaesberg, L. B.; Stephan, A.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n February 2025.\n
arXiv:2502.19559 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 2 downloads\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{becker_stay_2025,\n\ttitle = {Stay {Focused}: {Problem} {Drift} in {Multi}-{Agent} {Debate}},\n\tshorttitle = {Stay {Focused}},\n\turl = {http://arxiv.org/abs/2502.19559},\n\tdoi = {10.48550/arXiv.2502.19559},\n\tabstract = {Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations, particularly when scaling them to longer reasoning chains. In this study, we unveil a new issue of multi-agent debate: discussions drift away from the initial problem over multiple turns. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, we perform a human study with eight experts on discussions suffering from problem drift, who find the most common issues are a lack of progress (35\\% of cases), low-quality feedback (26\\% of cases), and a lack of clarity (25\\% of cases). To systematically address the issue of problem drift, we propose DRIFTJudge, a method based on LLM-as-a-judge, to detect problem drift at test-time. We further propose DRIFTPolicy, a method to mitigate 31\\% of problem drift cases. Our study can be seen as a first step to understanding a key limitation of multi-agent debate, highlighting pathways for improving their effectiveness in the future.},\n\turldate = {2025-02-28},\n\tpublisher = {arXiv},\n\tauthor = {Becker, Jonas and Kaesberg, Lars Benedikt and Stephan, Andreas and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tmonth = feb,\n\tyear = {2025},\n\tnote = {arXiv:2502.19559 [cs]},\n\tkeywords = {!tr, Computer Science - Computation and Language, nlp\\_llm, nlp\\_multiagent},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n
\n Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations, particularly when scaling them to longer reasoning chains. In this study, we unveil a new issue of multi-agent debate: discussions drift away from the initial problem over multiple turns. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, we perform a human study with eight experts on discussions suffering from problem drift, who find the most common issues are a lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To systematically address the issue of problem drift, we propose DRIFTJudge, a method based on LLM-as-a-judge, to detect problem drift at test-time. We further propose DRIFTPolicy, a method to mitigate 31% of problem drift cases. Our study can be seen as a first step to understanding a key limitation of multi-agent debate, highlighting pathways for improving their effectiveness in the future.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization.\n \n \n \n \n\n\n \n Kirstein, F.; Wahle, J. P.; Gipp, B.; and Ruas, T.\n\n\n \n\n\n\n
Journal of Artificial Intelligence Research, 82: 313–365. January 2025.\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{kirstein_cads_2025,\n\ttitle = {{CADS}: {A} {Systematic} {Literature} {Review} on the {Challenges} of {Abstractive} {Dialogue} {Summarization}},\n\tvolume = {82},\n\tissn = {1076-9757},\n\tshorttitle = {{CADS}},\n\turl = {http://jair.org/index.php/jair/article/view/16674},\n\tdoi = {10.1613/jair.1.16674},\n\tabstract = {Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although focused reviews have been conducted on this topic, there is a lack of comprehensive work that details the core challenges of dialogue summarization, unifies the differing understanding of the task, and aligns proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. Recent advances in training methods have led to substantial improvements in language-related challenges. However, challenges such as comprehension, factuality, and salience remain difficult and present significant research opportunities. We further investigate how these approaches are typically analyzed, covering the datasets for the subdomains of dialogue (e.g., meeting, customer service, and medical), the established automatic metrics (e.g., ROUGE), and common human evaluation approaches for assigning scores and evaluating annotator agreement. We observe that only a few datasets (i.e., SAMSum, AMI, DialogSum) are widely used. Despite its limitations, the ROUGE metric is the most commonly used, while human evaluation, considered the gold standard, is frequently reported without sufficient detail on the inter-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that our described challenge taxonomy remains relevant despite a potential shift in relevance and difficulty.},\n\turldate = {2025-01-29},\n\tjournal = {Journal of Artificial Intelligence Research},\n\tauthor = {Kirstein, Frederic and Wahle, Jan Philip and Gipp, Bela and Ruas, Terry},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {313--365},\n}\n\n\n\n\n\n\n\n\n
\n\n\n
\n Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although focused reviews have been conducted on this topic, there is a lack of comprehensive work that details the core challenges of dialogue summarization, unifies the differing understanding of the task, and aligns proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. Recent advances in training methods have led to substantial improvements in language-related challenges. However, challenges such as comprehension, factuality, and salience remain difficult and present significant research opportunities. We further investigate how these approaches are typically analyzed, covering the datasets for the subdomains of dialogue (e.g., meeting, customer service, and medical), the established automatic metrics (e.g., ROUGE), and common human evaluation approaches for assigning scores and evaluating annotator agreement. We observe that only a few datasets (i.e., SAMSum, AMI, DialogSum) are widely used. Despite its limitations, the ROUGE metric is the most commonly used, while human evaluation, considered the gold standard, is frequently reported without sufficient detail on the inter-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that our described challenge taxonomy remains relevant despite a potential shift in relevance and difficulty.\n
\n\n\n
\n\n\n\n\n\n