Why Do Large Language Models (LLMs) Struggle to Count Letters?. Fu, T., Ferrando, R., Conde, J., Arriaga, C., & Reviriego, P. December, 2024. arXiv:2412.18626 [cs]
Why Do Large Language Models (LLMs) Struggle to Count Letters? [link]Paper  doi  abstract   bibtex   
TAIRAN FU, College of Mechanical & Electrical Engineering, Nanjing University of Aeronautics and Astronautics, China RAQUEL FERRANDO, JAVIER CONDE, CARLOS ARRIAGA, and PEDRO REVIRIEGO, ETSI de Telecomunicación, Universidad Politécnica de Madrid, Spain Large Language Models (LLMs) have achieved unprecedented performance on many complex tasks, being able, for example, to answer questions on almost any topic. However, they struggle with other simple tasks, such as counting the occurrences of letters in a word, as illustrated by the inability of many LLMs to count the number of "r" letters in "strawberry". Several works have studied this problem and linked it to the tokenization used by LLMs, to the intrinsic limitations of the attention mechanism, or to the lack of character-level training data. In this paper, we conduct an experimental study to evaluate the relations between the LLM errors when counting letters with 1) the frequency of the word and its components in the training dataset and 2) the complexity of the counting operation. We present a comprehensive analysis of the errors of LLMs when counting letter occurrences by evaluating a representative group of models over a large number of words. The results show a number of consistent trends in the models evaluated: 1) models are capable of recognizing the letters but not counting them; 2) the frequency of the word and tokens in the word does not have a significant impact on the LLM errors; 3) there is a positive correlation of letter frequency with errors, more frequent letters tend to have more counting errors, 4) the errors show a strong correlation with the number of letters or tokens in a word and 5) the strongest correlation occurs with the number of letters with counts larger than one, with most models being unable to correctly count words in which letters appear more than twice. These results suggest that the problems of LLMs to count letters are not related to the frequency of words or tokens in the training data but to the complexity of the counting operation. However, further studies are needed to build a better understanding of the limitations of LLMs to count the letters in a word. CCS Concepts: • Software and its engineering → Empirical software validation; • Computing methodologies → Natural language generation; Language resources.
@misc{fu_why_2024,
	title = {Why {Do} {Large} {Language} {Models} ({LLMs}) {Struggle} to {Count} {Letters}?},
	url = {http://arxiv.org/abs/2412.18626},
	doi = {10.48550/arXiv.2412.18626},
	abstract = {TAIRAN FU, College of Mechanical \& Electrical Engineering, Nanjing University of Aeronautics and Astronautics, China RAQUEL FERRANDO, JAVIER CONDE, CARLOS ARRIAGA, and PEDRO REVIRIEGO, ETSI de Telecomunicación, Universidad Politécnica de Madrid, Spain Large Language Models (LLMs) have achieved unprecedented performance on many complex tasks, being able, for example, to answer questions on almost any topic. However, they struggle with other simple tasks, such as counting the occurrences of letters in a word, as illustrated by the inability of many LLMs to count the number of "r" letters in "strawberry". Several works have studied this problem and linked it to the tokenization used by LLMs, to the intrinsic limitations of the attention mechanism, or to the lack of character-level training data. In this paper, we conduct an experimental study to evaluate the relations between the LLM errors when counting letters with 1) the frequency of the word and its components in the training dataset and 2) the complexity of the counting operation. We present a comprehensive analysis of the errors of LLMs when counting letter occurrences by evaluating a representative group of models over a large number of words. The results show a number of consistent trends in the models evaluated: 1) models are capable of recognizing the letters but not counting them; 2) the frequency of the word and tokens in the word does not have a significant impact on the LLM errors; 3) there is a positive correlation of letter frequency with errors, more frequent letters tend to have more counting errors, 4) the errors show a strong correlation with the number of letters or tokens in a word and 5) the strongest correlation occurs with the number of letters with counts larger than one, with most models being unable to correctly count words in which letters appear more than twice. These results suggest that the problems of LLMs to count letters are not related to the frequency of words or tokens in the training data but to the complexity of the counting operation. However, further studies are needed to build a better understanding of the limitations of LLMs to count the letters in a word. CCS Concepts: • Software and its engineering → Empirical software validation; • Computing methodologies → Natural language generation; Language resources.},
	language = {en},
	urldate = {2025-12-28},
	publisher = {arXiv},
	author = {Fu, Tairan and Ferrando, Raquel and Conde, Javier and Arriaga, Carlos and Reviriego, Pedro},
	month = dec,
	year = {2024},
	note = {arXiv:2412.18626 [cs]},
	keywords = {Computer Science - Computation and Language},
}

Downloads: 0