\n \n \n
\n
\n\n \n \n \n \n \n \n SPaRC: A Spatial Pathfinding Reasoning Challenge.\n \n \n \n \n\n\n \n Kaesberg, L. B.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n May 2025.\n
arXiv:2505.16686 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{kaesberg_sparc_2025,\n\ttitle = {{SPaRC}: {A} {Spatial} {Pathfinding} {Reasoning} {Challenge}},\n\tshorttitle = {{SPaRC}},\n\turl = {http://arxiv.org/abs/2505.16686},\n\tdoi = {10.48550/arXiv.2505.16686},\n\tabstract = {Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0\\%; 94.5\\% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8\\%; 1.1\\% on hard puzzles). Models often generate invalid paths ({\\textgreater}50\\% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models' spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.},\n\turldate = {2025-05-23},\n\tpublisher = {arXiv},\n\tauthor = {Kaesberg, Lars Benedikt and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tmonth = may,\n\tyear = {2025},\n\tnote = {arXiv:2505.16686 [cs]},\n\tkeywords = {!tr, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, agents\\_reasoning, nlp\\_agents},\n}\n\n\n\n\n\n\n\n\n\n\n\n
\n
\n\n\n
\n Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles). Models often generate invalid paths (\\textgreater50% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models' spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection.\n \n \n \n \n\n\n \n Muhammad, S. H.; Ousidhoum, N.; Abdulmumin, I.; Yimam, S. M.; Wahle, J. P.; Ruas, T.; Beloucif, M.; Kock, C. D.; Belay, T. D.; Ahmad, I. S.; Surange, N.; Teodorescu, D.; Adelani, D. I.; Aji, A. F.; Ali, F.; Araujo, V.; Ayele, A. A.; Ignat, O.; Panchenko, A.; Zhou, Y.; and Mohammad, S. M.\n\n\n \n\n\n\n March 2025.\n
arXiv:2503.07269 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{muhammad_semeval-2025_2025,\n\ttitle = {{SemEval}-2025 {Task} 11: {Bridging} the {Gap} in {Text}-{Based} {Emotion} {Detection}},\n\tshorttitle = {{SemEval}-2025 {Task} 11},\n\turl = {http://arxiv.org/abs/2503.07269},\n\tdoi = {10.48550/arXiv.2503.07269},\n\tabstract = {We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and spoken across various continents. The data instances are multi-labeled into six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) emotion labels in monolingual settings, (b) emotion intensity scores, and (c) emotion labels in cross-lingual settings. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available.},\n\turldate = {2025-03-11},\n\tpublisher = {arXiv},\n\tauthor = {Muhammad, Shamsuddeen Hassan and Ousidhoum, Nedjma and Abdulmumin, Idris and Yimam, Seid Muhie and Wahle, Jan Philip and Ruas, Terry and Beloucif, Meriem and Kock, Christine De and Belay, Tadesse Destaw and Ahmad, Ibrahim Said and Surange, Nirmal and Teodorescu, Daniela and Adelani, David Ifeoluwa and Aji, Alham Fikri and Ali, Felermino and Araujo, Vladimir and Ayele, Abinew Ali and Ignat, Oana and Panchenko, Alexander and Zhou, Yi and Mohammad, Saif M.},\n\tmonth = mar,\n\tyear = {2025},\n\tnote = {arXiv:2503.07269 [cs]},\n\tkeywords = {!tr\\_author, Computer Science - Computation and Language, semeval},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n
\n\n\n
\n We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and spoken across various continents. The data instances are multi-labeled into six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) emotion labels in monolingual settings, (b) emotion intensity scores, and (c) emotion labels in cross-lingual settings. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator.\n \n \n \n \n\n\n \n Kirstein, F. T.; Lima Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; Schockaert, S.; Darwish, K.; and Agarwal, A., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 561–574, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kirstein-etal-2025-meeting,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {Is my {Meeting} {Summary} {Good}? {Estimating} {Quality} with a {Multi}-{LLM} {Evaluator}},\n\turl = {https://aclanthology.org/2025.coling-industry.48/},\n\tabstract = {The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA`s components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework`s flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}: {Industry} {Track}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kirstein, Frederic Thomas and Lima Ruas, Terry and Gipp, Bela},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven and Darwish, Kareem and Agarwal, Apoorv},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {561--574},\n}\n\n\n\n\n\n\n\n
\n
\n\n\n
\n The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA`s components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework`s flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Towards Human Understanding of Paraphrase Types in Large Language Models.\n \n \n \n \n\n\n \n Meier, D.; Wahle, J. P.; Lima Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; and Schockaert, S., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics, pages 6298–6316, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{meier-etal-2025-towards,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {Towards {Human} {Understanding} of {Paraphrase} {Types} in {Large} {Language} {Models}},\n\turl = {https://aclanthology.org/2025.coling-main.421/},\n\tabstract = {Paraphrases represent a human`s intuitive ability to understand expressions presented in various different ways. Current paraphrase evaluations of language models primarily use binary approaches, offering limited interpretability of specific text changes. Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes and offer a granular view of the flexibility in linguistic expression (e.g., a shift in syntax or vocabulary used). In this study, we assess the human preferences towards ChatGPT in generating English paraphrases with ten APTs and five prompting techniques. We introduce APTY (Atomic Paraphrase TYpes), a dataset of 800 sentence-level and word-level annotations by 15 annotators. The dataset also provides a human preference ranking of paraphrases with different types that can be used to fine-tune models with RLHF and DPO methods. Our results reveal that ChatGPT and a DPO-trained LLama 7B model can generate simple APTs, such as additions and deletions, but struggle with complex structures (e.g., subordination changes). This study contributes to understanding which aspects of paraphrasing language models have already succeeded at understanding and what remains elusive. In addition, we show how our curated datasets can be used to develop language models with specific linguistic capabilities.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Meier, Dominik and Wahle, Jan Philip and Lima Ruas, Terry and Gipp, Bela},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {6298--6316},\n}\n\n\n\n\n\n\n\n
\n
\n\n\n
\n Paraphrases represent a human`s intuitive ability to understand expressions presented in various different ways. Current paraphrase evaluations of language models primarily use binary approaches, offering limited interpretability of specific text changes. Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes and offer a granular view of the flexibility in linguistic expression (e.g., a shift in syntax or vocabulary used). In this study, we assess the human preferences towards ChatGPT in generating English paraphrases with ten APTs and five prompting techniques. We introduce APTY (Atomic Paraphrase TYpes), a dataset of 800 sentence-level and word-level annotations by 15 annotators. The dataset also provides a human preference ranking of paraphrases with different types that can be used to fine-tune models with RLHF and DPO methods. Our results reveal that ChatGPT and a DPO-trained LLama 7B model can generate simple APTs, such as additions and deletions, but struggle with complex structures (e.g., subordination changes). This study contributes to understanding which aspects of paraphrasing language models have already succeeded at understanding and what remains elusive. In addition, we show how our curated datasets can be used to develop language models with specific linguistic capabilities.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n What`s Wrong? Refining Meeting Summaries with LLM Feedback.\n \n \n \n \n\n\n \n Kirstein, F. T.; Lima Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; and Schockaert, S., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics, pages 2100–2120, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{kirstein-etal-2025-whats,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {What`s {Wrong}? {Refining} {Meeting} {Summaries} with {LLM} {Feedback}},\n\turl = {https://aclanthology.org/2025.coling-main.143/},\n\tabstract = {Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Kirstein, Frederic Thomas and Lima Ruas, Terry and Gipp, Bela},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {2100--2120},\n}\n\n\n\n\n\n\n\n
\n
\n\n\n
\n Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Citation Amnesia: On The Recency Bias of NLP and Other Academic Fields.\n \n \n \n \n\n\n \n Wahle, J. P.; Lima Ruas, T.; Abdalla, M.; Gipp, B.; and Mohammad, S. M.\n\n\n \n\n\n\n In Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B. D.; and Schockaert, S., editor(s),
Proceedings of the 31st International Conference on Computational Linguistics, pages 1027–1044, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{wahle-etal-2025-citation,\n\taddress = {Abu Dhabi, UAE},\n\ttitle = {Citation {Amnesia}: {On} {The} {Recency} {Bias} of {NLP} and {Other} {Academic} {Fields}},\n\turl = {https://aclanthology.org/2025.coling-main.69/},\n\tabstract = {This study examines the tendency to cite older work across 20 fields of study over 43 years (1980–2023). We put NLP`s propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to them over time or whether differences can be observed. Our analysis, based on a dataset of {\\textasciitilde}240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). The trend is strongest in NLP and ML research (-12.8\\% and -5.5\\% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4\\% across fields; -5.2\\% in humanities; -5.5\\% in formal sciences) — even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community`s engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.},\n\tbooktitle = {Proceedings of the 31st {International} {Conference} on {Computational} {Linguistics}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Wahle, Jan Philip and Lima Ruas, Terry and Abdalla, Mohamed and Gipp, Bela and Mohammad, Saif M.},\n\teditor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {1027--1044},\n}\n\n\n\n
\n
\n\n\n
\n This study examines the tendency to cite older work across 20 fields of study over 43 years (1980–2023). We put NLP`s propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to them over time or whether differences can be observed. Our analysis, based on a dataset of ~240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). The trend is strongest in NLP and ML research (-12.8% and -5.5% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) — even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community`s engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Stay Focused: Problem Drift in Multi-Agent Debate.\n \n \n \n \n\n\n \n Becker, J.; Kaesberg, L. B.; Stephan, A.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n February 2025.\n
arXiv:2502.19559 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{becker_stay_2025,\n\ttitle = {Stay {Focused}: {Problem} {Drift} in {Multi}-{Agent} {Debate}},\n\tshorttitle = {Stay {Focused}},\n\turl = {http://arxiv.org/abs/2502.19559},\n\tdoi = {10.48550/arXiv.2502.19559},\n\tabstract = {Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations, particularly when scaling them to longer reasoning chains. In this study, we unveil a new issue of multi-agent debate: discussions drift away from the initial problem over multiple turns. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, we perform a human study with eight experts on discussions suffering from problem drift, who find the most common issues are a lack of progress (35\\% of cases), low-quality feedback (26\\% of cases), and a lack of clarity (25\\% of cases). To systematically address the issue of problem drift, we propose DRIFTJudge, a method based on LLM-as-a-judge, to detect problem drift at test-time. We further propose DRIFTPolicy, a method to mitigate 31\\% of problem drift cases. Our study can be seen as a first step to understanding a key limitation of multi-agent debate, highlighting pathways for improving their effectiveness in the future.},\n\turldate = {2025-02-28},\n\tpublisher = {arXiv},\n\tauthor = {Becker, Jonas and Kaesberg, Lars Benedikt and Stephan, Andreas and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tmonth = feb,\n\tyear = {2025},\n\tnote = {arXiv:2502.19559 [cs]},\n\tkeywords = {!tr, Computer Science - Computation and Language, nlp\\_llm, nlp\\_multiagent},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n
\n\n\n
\n Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations, particularly when scaling them to longer reasoning chains. In this study, we unveil a new issue of multi-agent debate: discussions drift away from the initial problem over multiple turns. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, we perform a human study with eight experts on discussions suffering from problem drift, who find the most common issues are a lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To systematically address the issue of problem drift, we propose DRIFTJudge, a method based on LLM-as-a-judge, to detect problem drift at test-time. We further propose DRIFTPolicy, a method to mitigate 31% of problem drift cases. Our study can be seen as a first step to understanding a key limitation of multi-agent debate, highlighting pathways for improving their effectiveness in the future.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Voting or Consensus? Decision-Making in Multi-Agent Debate.\n \n \n \n \n\n\n \n Kaesberg, L. B.; Becker, J.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n February 2025.\n
arXiv:2502.19130 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{kaesberg_voting_2025,\n\ttitle = {Voting or {Consensus}? {Decision}-{Making} in {Multi}-{Agent} {Debate}},\n\tshorttitle = {Voting or {Consensus}?},\n\turl = {http://arxiv.org/abs/2502.19130},\n\tdoi = {10.48550/arXiv.2502.19130},\n\tabstract = {Much of the success of multi-agent debates depends on carefully choosing the right parameters. Among them, the decision-making protocol stands out. Systematic comparison of decision protocols is difficult because studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making addresses the challenges of different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time (i.e., decision protocol) to analyze how different methods affect the collaboration between agents and test different protocols on knowledge (MMLU, MMLU-Pro, GPQA) and reasoning datasets (StrategyQA, MuSR, SQuAD 2.0). Our results show that voting protocols improve performance by 13.2\\% in reasoning tasks and consensus protocols by 2.8\\% in knowledge tasks over the other decision protocol. Increasing the number of agents improves performance, while more discussion rounds before voting reduces it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3\\% with AAD and up to 7.4\\% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.},\n\turldate = {2025-02-27},\n\tpublisher = {arXiv},\n\tauthor = {Kaesberg, Lars Benedikt and Becker, Jonas and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tmonth = feb,\n\tyear = {2025},\n\tnote = {arXiv:2502.19130 [cs]},\n\tkeywords = {!tr, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems, nlp\\_llm, nlp\\_multiagent},\n}\n\n\n\n\n\n\n\n
\n
\n\n\n
\n Much of the success of multi-agent debates depends on carefully choosing the right parameters. Among them, the decision-making protocol stands out. Systematic comparison of decision protocols is difficult because studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making addresses the challenges of different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time (i.e., decision protocol) to analyze how different methods affect the collaboration between agents and test different protocols on knowledge (MMLU, MMLU-Pro, GPQA) and reasoning datasets (StrategyQA, MuSR, SQuAD 2.0). Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks over the other decision protocol. Increasing the number of agents improves performance, while more discussion rounds before voting reduces it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations.\n \n \n \n \n\n\n \n Kirstein, F.; Khan, M.; Wahle, J. P.; Ruas, T.; and Gipp, B.\n\n\n \n\n\n\n February 2025.\n
arXiv:2502.13001 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{kirstein_you_2025,\n\ttitle = {You need to {MIMIC} to get {FAME}: {Solving} {Meeting} {Transcript} {Scarcity} with a {Multi}-{Agent} {Conversations}},\n\tshorttitle = {You need to {MIMIC} to get {FAME}},\n\turl = {http://arxiv.org/abs/2502.13001},\n\tdoi = {10.48550/arXiv.2502.13001},\n\tabstract = {Meeting summarization suffers from limited high-quality data, mainly due to privacy restrictions and expensive collection processes. We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi-agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. A modular post-processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale. We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties. Human assessments show that FAME approximates real-meeting spontaneity (4.5/5 in naturalness), preserves speaker-centric challenges (3/5 in spoken language), and introduces richer information-oriented difficulty (4/5 in difficulty). These findings highlight that FAME is a good and scalable proxy for real-world meeting conditions. It enables new test scenarios for meeting summarization research and other conversation-centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.},\n\turldate = {2025-02-27},\n\tpublisher = {arXiv},\n\tauthor = {Kirstein, Frederic and Khan, Muneeb and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela},\n\tmonth = feb,\n\tyear = {2025},\n\tnote = {arXiv:2502.13001 [cs]},\n\tkeywords = {!tr, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, nlp\\_dataset, nlp\\_meeting\\_sum},\n}\n\n\n\n\n\n\n\n\n\n\n\n
\n
\n\n\n
\n Meeting summarization suffers from limited high-quality data, mainly due to privacy restrictions and expensive collection processes. We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi-agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. A modular post-processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale. We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties. Human assessments show that FAME approximates real-meeting spontaneity (4.5/5 in naturalness), preserves speaker-centric challenges (3/5 in spoken language), and introduces richer information-oriented difficulty (4/5 in difficulty). These findings highlight that FAME is a good and scalable proxy for real-world meeting conditions. It enables new test scenarios for meeting summarization research and other conversation-centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages.\n \n \n \n \n\n\n \n Muhammad, S. H.; Ousidhoum, N.; Abdulmumin, I.; Wahle, J. P.; Ruas, T.; Beloucif, M.; Kock, C. d.; Surange, N.; Teodorescu, D.; Ahmad, I. S.; Adelani, D. I.; Aji, A. F.; Ali, F. D. M. A.; Alimova, I.; Araujo, V.; Babakov, N.; Baes, N.; Bucur, A.; Bukula, A.; Cao, G.; Cardenas, R. T.; Chevi, R.; Chukwuneke, C. I.; Ciobotaru, A.; Dementieva, D.; Gadanya, M. S.; Geislinger, R.; Gipp, B.; Hourrane, O.; Ignat, O.; Lawan, F. I.; Mabuya, R.; Mahendra, R.; Marivate, V.; Piper, A.; Panchenko, A.; Ferreira, C. H. P.; Protasov, V.; Rutunda, S.; Shrivastava, M.; Udrea, A. C.; Wanzare, L. D. A.; Wu, S.; Wunderlich, F. V.; Zhafran, H. M.; Zhang, T.; Zhou, Y.; and Mohammad, S. M.\n\n\n \n\n\n\n February 2025.\n
arXiv:2502.11926 [cs]\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@misc{muhammad_brighter_2025,\n\ttitle = {{BRIGHTER}: {BRIdging} the {Gap} in {Human}-{Annotated} {Textual} {Emotion} {Recognition} {Datasets} for 28 {Languages}},\n\tshorttitle = {{BRIGHTER}},\n\turl = {http://arxiv.org/abs/2502.11926},\n\tdoi = {10.48550/arXiv.2502.11926},\n\tabstract = {People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks -- significantly impacts different applications in NLP and other fields, most work in the area is focused on high-resource languages. Therefore, this has led to major disparities in research and proposed solutions, especially for low-resource languages that suffer from the lack of high-quality datasets. In this paper, we present BRIGHTER-- a collection of multilabeled emotion-annotated datasets in 28 different languages. BRIGHTER covers predominantly low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances from various domains annotated by fluent speakers. We describe the data collection and annotation processes and the challenges of building these datasets. Then, we report different experimental results for monolingual and crosslingual multi-label emotion identification, as well as intensity-level emotion recognition. We investigate results with and without using LLMs and analyse the large variability in performance across languages and text domains. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition and discuss their impact and utility.},\n\turldate = {2025-02-18},\n\tpublisher = {arXiv},\n\tauthor = {Muhammad, Shamsuddeen Hassan and Ousidhoum, Nedjma and Abdulmumin, Idris and Wahle, Jan Philip and Ruas, Terry and Beloucif, Meriem and Kock, Christine de and Surange, Nirmal and Teodorescu, Daniela and Ahmad, Ibrahim Said and Adelani, David Ifeoluwa and Aji, Alham Fikri and Ali, Felermino D. M. A. and Alimova, Ilseyar and Araujo, Vladimir and Babakov, Nikolay and Baes, Naomi and Bucur, Ana-Maria and Bukula, Andiswa and Cao, Guanqun and Cardenas, Rodrigo Tufino and Chevi, Rendi and Chukwuneke, Chiamaka Ijeoma and Ciobotaru, Alexandra and Dementieva, Daryna and Gadanya, Murja Sani and Geislinger, Robert and Gipp, Bela and Hourrane, Oumaima and Ignat, Oana and Lawan, Falalu Ibrahim and Mabuya, Rooweither and Mahendra, Rahmad and Marivate, Vukosi and Piper, Andrew and Panchenko, Alexander and Ferreira, Charles Henrique Porto and Protasov, Vitaly and Rutunda, Samuel and Shrivastava, Manish and Udrea, Aura Cristina and Wanzare, Lilian Diana Awuor and Wu, Sophie and Wunderlich, Florian Valentin and Zhafran, Hanif Muhammad and Zhang, Tianhui and Zhou, Yi and Mohammad, Saif M.},\n\tmonth = feb,\n\tyear = {2025},\n\tnote = {arXiv:2502.11926 [cs]},\n\tkeywords = {!tr\\_author, Computer Science - Computation and Language, nlp\\_dataset, nlp\\_semeval},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n
\n\n\n
\n People worldwide use language in subtle and complex ways to express emotions. While emotion recognition – an umbrella term for several NLP tasks – significantly impacts different applications in NLP and other fields, most work in the area is focused on high-resource languages. Therefore, this has led to major disparities in research and proposed solutions, especially for low-resource languages that suffer from the lack of high-quality datasets. In this paper, we present BRIGHTER– a collection of multilabeled emotion-annotated datasets in 28 different languages. BRIGHTER covers predominantly low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances from various domains annotated by fluent speakers. We describe the data collection and annotation processes and the challenges of building these datasets. Then, we report different experimental results for monolingual and crosslingual multi-label emotion identification, as well as intensity-level emotion recognition. We investigate results with and without using LLMs and analyse the large variability in performance across languages and text domains. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition and discuss their impact and utility.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization.\n \n \n \n \n\n\n \n Kirstein, F.; Wahle, J. P.; Gipp, B.; and Ruas, T.\n\n\n \n\n\n\n
Journal of Artificial Intelligence Research, 82: 313–365. January 2025.\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{kirstein_cads_2025,\n\ttitle = {{CADS}: {A} {Systematic} {Literature} {Review} on the {Challenges} of {Abstractive} {Dialogue} {Summarization}},\n\tvolume = {82},\n\tissn = {1076-9757},\n\tshorttitle = {{CADS}},\n\turl = {http://jair.org/index.php/jair/article/view/16674},\n\tdoi = {10.1613/jair.1.16674},\n\tabstract = {Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although focused reviews have been conducted on this topic, there is a lack of comprehensive work that details the core challenges of dialogue summarization, unifies the differing understanding of the task, and aligns proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. Recent advances in training methods have led to substantial improvements in language-related challenges. However, challenges such as comprehension, factuality, and salience remain difficult and present significant research opportunities. We further investigate how these approaches are typically analyzed, covering the datasets for the subdomains of dialogue (e.g., meeting, customer service, and medical), the established automatic metrics (e.g., ROUGE), and common human evaluation approaches for assigning scores and evaluating annotator agreement. We observe that only a few datasets (i.e., SAMSum, AMI, DialogSum) are widely used. Despite its limitations, the ROUGE metric is the most commonly used, while human evaluation, considered the gold standard, is frequently reported without sufficient detail on the inter-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that our described challenge taxonomy remains relevant despite a potential shift in relevance and difficulty.},\n\turldate = {2025-01-29},\n\tjournal = {Journal of Artificial Intelligence Research},\n\tauthor = {Kirstein, Frederic and Wahle, Jan Philip and Gipp, Bela and Ruas, Terry},\n\tmonth = jan,\n\tyear = {2025},\n\tpages = {313--365},\n}\n\n\n\n\n\n\n\n
\n
\n\n\n
\n Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although focused reviews have been conducted on this topic, there is a lack of comprehensive work that details the core challenges of dialogue summarization, unifies the differing understanding of the task, and aligns proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. Recent advances in training methods have led to substantial improvements in language-related challenges. However, challenges such as comprehension, factuality, and salience remain difficult and present significant research opportunities. We further investigate how these approaches are typically analyzed, covering the datasets for the subdomains of dialogue (e.g., meeting, customer service, and medical), the established automatic metrics (e.g., ROUGE), and common human evaluation approaches for assigning scores and evaluating annotator agreement. We observe that only a few datasets (i.e., SAMSum, AMI, DialogSum) are widely used. Despite its limitations, the ROUGE metric is the most commonly used, while human evaluation, considered the gold standard, is frequently reported without sufficient detail on the inter-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that our described challenge taxonomy remains relevant despite a potential shift in relevance and difficulty.\n
\n\n\n
\n\n\n\n\n\n