Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange. Satpute, A., Gießing, N., Greiner-Petter, A., Schubotz, M., Teschke, O., Aizawa, A., & Gipp, B. In 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2316–2320, Washington DC, USA, June, 2024. ACM. Core Rank A*

Paper doi abstract bibtex 1 download

Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research at https://github.com/gipplab/LLM-Investig-MathStackExchange.

@inproceedings{BibbaseSatputeGGS24a,
	address = {Washington DC, USA},
	title = {Can {LLMs} {Master} {Math}? {Investigating} {Large} {Language} {Models} on {Math} {Stack} {Exchange}},
	shorttitle = {Can {LLMs} {Master} {Math}?},
	url = {https://arxiv.org/abs/2404.00344},
	doi = {10.1145/3626772.3657945},
	abstract = {Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research at https://github.com/gipplab/LLM-Investig-MathStackExchange.},
	booktitle = {47th {International} {ACM} {SIGIR} {Conference} on {Research} and {Development} in {Information} {Retrieval}},
	publisher = {ACM},
	author = {Satpute, Ankit and Gießing, Noah and Greiner-Petter, André and Schubotz, Moritz and Teschke, Olaf and Aizawa, Akiko and Gipp, Bela},
	month = jun,
	year = {2024},
	note = {Core Rank A*},
	pages = {2316--2320},
}

Downloads: 1

{"_id":"P3gfwEn7BWZJodbZx","bibbaseid":"satpute-gieing-greinerpetter-schubotz-teschke-aizawa-gipp-canllmsmastermathinvestigatinglargelanguagemodelsonmathstackexchange-2024","author_short":["Satpute, A.","Gießing, N.","Greiner-Petter, A.","Schubotz, M.","Teschke, O.","Aizawa, A.","Gipp, B."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Washington DC, USA","title":"Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange","shorttitle":"Can LLMs Master Math?","url":"https://arxiv.org/abs/2404.00344","doi":"10.1145/3626772.3657945","abstract":"Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research at https://github.com/gipplab/LLM-Investig-MathStackExchange.","booktitle":"47th International ACM SIGIR Conference on Research and Development in Information Retrieval","publisher":"ACM","author":[{"propositions":[],"lastnames":["Satpute"],"firstnames":["Ankit"],"suffixes":[]},{"propositions":[],"lastnames":["Gießing"],"firstnames":["Noah"],"suffixes":[]},{"propositions":[],"lastnames":["Greiner-Petter"],"firstnames":["André"],"suffixes":[]},{"propositions":[],"lastnames":["Schubotz"],"firstnames":["Moritz"],"suffixes":[]},{"propositions":[],"lastnames":["Teschke"],"firstnames":["Olaf"],"suffixes":[]},{"propositions":[],"lastnames":["Aizawa"],"firstnames":["Akiko"],"suffixes":[]},{"propositions":[],"lastnames":["Gipp"],"firstnames":["Bela"],"suffixes":[]}],"month":"June","year":"2024","note":"Core Rank A*","pages":"2316–2320","bibtex":"@inproceedings{BibbaseSatputeGGS24a,\n\taddress = {Washington DC, USA},\n\ttitle = {Can {LLMs} {Master} {Math}? {Investigating} {Large} {Language} {Models} on {Math} {Stack} {Exchange}},\n\tshorttitle = {Can {LLMs} {Master} {Math}?},\n\turl = {https://arxiv.org/abs/2404.00344},\n\tdoi = {10.1145/3626772.3657945},\n\tabstract = {Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research at https://github.com/gipplab/LLM-Investig-MathStackExchange.},\n\tbooktitle = {47th {International} {ACM} {SIGIR} {Conference} on {Research} and {Development} in {Information} {Retrieval}},\n\tpublisher = {ACM},\n\tauthor = {Satpute, Ankit and Gießing, Noah and Greiner-Petter, André and Schubotz, Moritz and Teschke, Olaf and Aizawa, Akiko and Gipp, Bela},\n\tmonth = jun,\n\tyear = {2024},\n\tnote = {Core Rank A*},\n\tpages = {2316--2320},\n}\n\n","author_short":["Satpute, A.","Gießing, N.","Greiner-Petter, A.","Schubotz, M.","Teschke, O.","Aizawa, A.","Gipp, B."],"key":"BibbaseSatputeGGS24a","id":"BibbaseSatputeGGS24a","bibbaseid":"satpute-gieing-greinerpetter-schubotz-teschke-aizawa-gipp-canllmsmastermathinvestigatinglargelanguagemodelsonmathstackexchange-2024","role":"author","urls":{"Paper":"https://arxiv.org/abs/2404.00344"},"metadata":{"authorlinks":{}},"downloads":1},"bibtype":"inproceedings","biburl":"https://api.zotero.org/users/7689706/collections/IBJGRWZX/items?key=R0b523dc3oYLxTGap1H4YXgd&format=bibtex&limit=100","dataSources":["wZtCXbB8M6GYSQHMx","QGwcHf7xnb5mCCQi7"],"keywords":[],"search_terms":["llms","master","math","investigating","large","language","models","math","stack","exchange","satpute","gießing","greiner-petter","schubotz","teschke","aizawa","gipp"],"title":"Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange","year":2024,"downloads":1}