SkillFactory: Self-Distillation For Learning Cognitive Behaviors. Sprague, Z., Lu, J., Wadhwa, M., Keh, S., Ren, M., & Durrett, G. December, 2025. arXiv:2512.04072 [cs]
Paper doi abstract bibtex Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren’t exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These “silver” SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use1.
@misc{sprague_skillfactory_2025,
title = {{SkillFactory}: {Self}-{Distillation} {For} {Learning} {Cognitive} {Behaviors}},
shorttitle = {{SkillFactory}},
url = {http://arxiv.org/abs/2512.04072},
doi = {10.48550/arXiv.2512.04072},
abstract = {Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren’t exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These “silver” SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use1.},
language = {en},
urldate = {2025-12-18},
publisher = {arXiv},
author = {Sprague, Zayne and Lu, Jack and Wadhwa, Manya and Keh, Sedrick and Ren, Mengye and Durrett, Greg},
month = dec,
year = {2025},
note = {arXiv:2512.04072 [cs]},
keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Explorable},
}
Downloads: 0
{"_id":"35W3kaPEec46CfdqK","bibbaseid":"sprague-lu-wadhwa-keh-ren-durrett-skillfactoryselfdistillationforlearningcognitivebehaviors-2025","author_short":["Sprague, Z.","Lu, J.","Wadhwa, M.","Keh, S.","Ren, M.","Durrett, G."],"bibdata":{"bibtype":"misc","type":"misc","title":"SkillFactory: Self-Distillation For Learning Cognitive Behaviors","shorttitle":"SkillFactory","url":"http://arxiv.org/abs/2512.04072","doi":"10.48550/arXiv.2512.04072","abstract":"Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren’t exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These “silver” SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use1.","language":"en","urldate":"2025-12-18","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Sprague"],"firstnames":["Zayne"],"suffixes":[]},{"propositions":[],"lastnames":["Lu"],"firstnames":["Jack"],"suffixes":[]},{"propositions":[],"lastnames":["Wadhwa"],"firstnames":["Manya"],"suffixes":[]},{"propositions":[],"lastnames":["Keh"],"firstnames":["Sedrick"],"suffixes":[]},{"propositions":[],"lastnames":["Ren"],"firstnames":["Mengye"],"suffixes":[]},{"propositions":[],"lastnames":["Durrett"],"firstnames":["Greg"],"suffixes":[]}],"month":"December","year":"2025","note":"arXiv:2512.04072 [cs]","keywords":"Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Explorable","bibtex":"@misc{sprague_skillfactory_2025,\n\ttitle = {{SkillFactory}: {Self}-{Distillation} {For} {Learning} {Cognitive} {Behaviors}},\n\tshorttitle = {{SkillFactory}},\n\turl = {http://arxiv.org/abs/2512.04072},\n\tdoi = {10.48550/arXiv.2512.04072},\n\tabstract = {Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren’t exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These “silver” SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use1.},\n\tlanguage = {en},\n\turldate = {2025-12-18},\n\tpublisher = {arXiv},\n\tauthor = {Sprague, Zayne and Lu, Jack and Wadhwa, Manya and Keh, Sedrick and Ren, Mengye and Durrett, Greg},\n\tmonth = dec,\n\tyear = {2025},\n\tnote = {arXiv:2512.04072 [cs]},\n\tkeywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Explorable},\n}\n\n\n\n","author_short":["Sprague, Z.","Lu, J.","Wadhwa, M.","Keh, S.","Ren, M.","Durrett, G."],"key":"sprague_skillfactory_2025","id":"sprague_skillfactory_2025","bibbaseid":"sprague-lu-wadhwa-keh-ren-durrett-skillfactoryselfdistillationforlearningcognitivebehaviors-2025","role":"author","urls":{"Paper":"http://arxiv.org/abs/2512.04072"},"keyword":["Computer Science - Artificial Intelligence","Computer Science - Computation and Language","Explorable"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["computer science - artificial intelligence","computer science - computation and language","explorable"],"search_terms":["skillfactory","self","distillation","learning","cognitive","behaviors","sprague","lu","wadhwa","keh","ren","durrett"],"title":"SkillFactory: Self-Distillation For Learning Cognitive Behaviors","year":2025}