Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange. Satpute, A., Gießing, N., Greiner-Petter, A., Schubotz, M., Teschke, O., Aizawa, A., & Gipp, B. In 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2316–2320, Washington DC, USA, June, 2024. ACM. Core Rank A*
Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange [link]Paper  doi  abstract   bibtex   1 download  
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research at https://github.com/gipplab/LLM-Investig-MathStackExchange.
@inproceedings{BibbaseSatputeGGS24a,
	address = {Washington DC, USA},
	title = {Can {LLMs} {Master} {Math}? {Investigating} {Large} {Language} {Models} on {Math} {Stack} {Exchange}},
	shorttitle = {Can {LLMs} {Master} {Math}?},
	url = {https://arxiv.org/abs/2404.00344},
	doi = {10.1145/3626772.3657945},
	abstract = {Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research at https://github.com/gipplab/LLM-Investig-MathStackExchange.},
	booktitle = {47th {International} {ACM} {SIGIR} {Conference} on {Research} and {Development} in {Information} {Retrieval}},
	publisher = {ACM},
	author = {Satpute, Ankit and Gießing, Noah and Greiner-Petter, André and Schubotz, Moritz and Teschke, Olaf and Aizawa, Akiko and Gipp, Bela},
	month = jun,
	year = {2024},
	note = {Core Rank A*},
	pages = {2316--2320},
}

Downloads: 1