SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model. Allal, L. B., Lozhkov, A., Bakouch, E., Blázquez, G. M., Penedo, G., Tunstall, L., Marafioti, A., Kydlíček, H., Lajarín, A. P., Srivastav, V., Lochner, J., Fahlgren, C., Nguyen, X., Fourrier, C., Burtenshaw, B., Larcher, H., Zhao, H., Zakka, C., Morlon, M., Raffel, C., Werra, L. v., & Wolf, T. February, 2025. arXiv:2502.02737 [cs]

Paper doi abstract bibtex

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

@misc{allal_smollm2_2025,
	title = {{SmolLM2}: {When} {Smol} {Goes} {Big} -- {Data}-{Centric} {Training} of a {Small} {Language} {Model}},
	shorttitle = {{SmolLM2}},
	url = {http://arxiv.org/abs/2502.02737},
	doi = {10.48550/arXiv.2502.02737},
	abstract = {While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on {\textasciitilde}11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.},
	urldate = {2025-02-11},
	publisher = {arXiv},
	author = {Allal, Loubna Ben and Lozhkov, Anton and Bakouch, Elie and Blázquez, Gabriel Martín and Penedo, Guilherme and Tunstall, Lewis and Marafioti, Andrés and Kydlíček, Hynek and Lajarín, Agustín Piqueres and Srivastav, Vaibhav and Lochner, Joshua and Fahlgren, Caleb and Nguyen, Xuan-Son and Fourrier, Clémentine and Burtenshaw, Ben and Larcher, Hugo and Zhao, Haojun and Zakka, Cyril and Morlon, Mathieu and Raffel, Colin and Werra, Leandro von and Wolf, Thomas},
	month = feb,
	year = {2025},
	note = {arXiv:2502.02737 [cs]},
	keywords = {Computer Science - Computation and Language},
}

Downloads: 0

{"_id":"jL2mptdAiP6C29CBk","bibbaseid":"allal-lozhkov-bakouch-blzquez-penedo-tunstall-marafioti-kydlek-etal-smollm2whensmolgoesbigdatacentrictrainingofasmalllanguagemodel-2025","author_short":["Allal, L. B.","Lozhkov, A.","Bakouch, E.","Blázquez, G. M.","Penedo, G.","Tunstall, L.","Marafioti, A.","Kydlíček, H.","Lajarín, A. P.","Srivastav, V.","Lochner, J.","Fahlgren, C.","Nguyen, X.","Fourrier, C.","Burtenshaw, B.","Larcher, H.","Zhao, H.","Zakka, C.","Morlon, M.","Raffel, C.","Werra, L. v.","Wolf, T."],"bibdata":{"bibtype":"misc","type":"misc","title":"SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model","shorttitle":"SmolLM2","url":"http://arxiv.org/abs/2502.02737","doi":"10.48550/arXiv.2502.02737","abstract":"While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art \"small\" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.","urldate":"2025-02-11","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Allal"],"firstnames":["Loubna","Ben"],"suffixes":[]},{"propositions":[],"lastnames":["Lozhkov"],"firstnames":["Anton"],"suffixes":[]},{"propositions":[],"lastnames":["Bakouch"],"firstnames":["Elie"],"suffixes":[]},{"propositions":[],"lastnames":["Blázquez"],"firstnames":["Gabriel","Martín"],"suffixes":[]},{"propositions":[],"lastnames":["Penedo"],"firstnames":["Guilherme"],"suffixes":[]},{"propositions":[],"lastnames":["Tunstall"],"firstnames":["Lewis"],"suffixes":[]},{"propositions":[],"lastnames":["Marafioti"],"firstnames":["Andrés"],"suffixes":[]},{"propositions":[],"lastnames":["Kydlíček"],"firstnames":["Hynek"],"suffixes":[]},{"propositions":[],"lastnames":["Lajarín"],"firstnames":["Agustín","Piqueres"],"suffixes":[]},{"propositions":[],"lastnames":["Srivastav"],"firstnames":["Vaibhav"],"suffixes":[]},{"propositions":[],"lastnames":["Lochner"],"firstnames":["Joshua"],"suffixes":[]},{"propositions":[],"lastnames":["Fahlgren"],"firstnames":["Caleb"],"suffixes":[]},{"propositions":[],"lastnames":["Nguyen"],"firstnames":["Xuan-Son"],"suffixes":[]},{"propositions":[],"lastnames":["Fourrier"],"firstnames":["Clémentine"],"suffixes":[]},{"propositions":[],"lastnames":["Burtenshaw"],"firstnames":["Ben"],"suffixes":[]},{"propositions":[],"lastnames":["Larcher"],"firstnames":["Hugo"],"suffixes":[]},{"propositions":[],"lastnames":["Zhao"],"firstnames":["Haojun"],"suffixes":[]},{"propositions":[],"lastnames":["Zakka"],"firstnames":["Cyril"],"suffixes":[]},{"propositions":[],"lastnames":["Morlon"],"firstnames":["Mathieu"],"suffixes":[]},{"propositions":[],"lastnames":["Raffel"],"firstnames":["Colin"],"suffixes":[]},{"propositions":[],"lastnames":["Werra"],"firstnames":["Leandro","von"],"suffixes":[]},{"propositions":[],"lastnames":["Wolf"],"firstnames":["Thomas"],"suffixes":[]}],"month":"February","year":"2025","note":"arXiv:2502.02737 [cs]","keywords":"Computer Science - Computation and Language","bibtex":"@misc{allal_smollm2_2025,\n\ttitle = {{SmolLM2}: {When} {Smol} {Goes} {Big} -- {Data}-{Centric} {Training} of a {Small} {Language} {Model}},\n\tshorttitle = {{SmolLM2}},\n\turl = {http://arxiv.org/abs/2502.02737},\n\tdoi = {10.48550/arXiv.2502.02737},\n\tabstract = {While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art \"small\" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on {\\textasciitilde}11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.},\n\turldate = {2025-02-11},\n\tpublisher = {arXiv},\n\tauthor = {Allal, Loubna Ben and Lozhkov, Anton and Bakouch, Elie and Blázquez, Gabriel Martín and Penedo, Guilherme and Tunstall, Lewis and Marafioti, Andrés and Kydlíček, Hynek and Lajarín, Agustín Piqueres and Srivastav, Vaibhav and Lochner, Joshua and Fahlgren, Caleb and Nguyen, Xuan-Son and Fourrier, Clémentine and Burtenshaw, Ben and Larcher, Hugo and Zhao, Haojun and Zakka, Cyril and Morlon, Mathieu and Raffel, Colin and Werra, Leandro von and Wolf, Thomas},\n\tmonth = feb,\n\tyear = {2025},\n\tnote = {arXiv:2502.02737 [cs]},\n\tkeywords = {Computer Science - Computation and Language},\n}\n\n\n\n","author_short":["Allal, L. B.","Lozhkov, A.","Bakouch, E.","Blázquez, G. M.","Penedo, G.","Tunstall, L.","Marafioti, A.","Kydlíček, H.","Lajarín, A. P.","Srivastav, V.","Lochner, J.","Fahlgren, C.","Nguyen, X.","Fourrier, C.","Burtenshaw, B.","Larcher, H.","Zhao, H.","Zakka, C.","Morlon, M.","Raffel, C.","Werra, L. v.","Wolf, T."],"key":"allal_smollm2_2025","id":"allal_smollm2_2025","bibbaseid":"allal-lozhkov-bakouch-blzquez-penedo-tunstall-marafioti-kydlek-etal-smollm2whensmolgoesbigdatacentrictrainingofasmalllanguagemodel-2025","role":"author","urls":{"Paper":"http://arxiv.org/abs/2502.02737"},"keyword":["Computer Science - Computation and Language"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/pa511","dataSources":["MpmemwLeQzDcKDq6x"],"keywords":["computer science - computation and language"],"search_terms":["smollm2","smol","goes","big","data","centric","training","small","language","model","allal","lozhkov","bakouch","blázquez","penedo","tunstall","marafioti","kydlíček","lajarín","srivastav","lochner","fahlgren","nguyen","fourrier","burtenshaw","larcher","zhao","zakka","morlon","raffel","werra","wolf"],"title":"SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model","year":2025}