How Transformers Learn to Plan via Multi-Token Prediction. Huang, J., Zhou, Z., Xia, R., Mirzasoleiman, B., Su, W., & Huang, W. April, 2026. arXiv:2604.11912 [cs.LG]
Paper doi abstract bibtex While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.
@misc{huang_how_2026,
title = {How {Transformers} {Learn} to {Plan} via {Multi}-{Token} {Prediction}},
url = {http://arxiv.org/abs/2604.11912},
doi = {10.48550/arXiv.2604.11912},
abstract = {While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.},
language = {en},
urldate = {2026-05-28},
publisher = {arXiv},
author = {Huang, Jianhao and Zhou, Zhanpeng and Xia, Renqiu and Mirzasoleiman, Baharan and Su, Weijie and Huang, Wei},
month = apr,
year = {2026},
note = {arXiv:2604.11912 [cs.LG]},
keywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning},
}
Downloads: 0
{"_id":"BcE2nabYRo9eczP6r","bibbaseid":"huang-zhou-xia-mirzasoleiman-su-huang-howtransformerslearntoplanviamultitokenprediction-2026","author_short":["Huang, J.","Zhou, Z.","Xia, R.","Mirzasoleiman, B.","Su, W.","Huang, W."],"bibdata":{"bibtype":"misc","type":"misc","title":"How Transformers Learn to Plan via Multi-Token Prediction","url":"http://arxiv.org/abs/2604.11912","doi":"10.48550/arXiv.2604.11912","abstract":"While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.","language":"en","urldate":"2026-05-28","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Huang"],"firstnames":["Jianhao"],"suffixes":[]},{"propositions":[],"lastnames":["Zhou"],"firstnames":["Zhanpeng"],"suffixes":[]},{"propositions":[],"lastnames":["Xia"],"firstnames":["Renqiu"],"suffixes":[]},{"propositions":[],"lastnames":["Mirzasoleiman"],"firstnames":["Baharan"],"suffixes":[]},{"propositions":[],"lastnames":["Su"],"firstnames":["Weijie"],"suffixes":[]},{"propositions":[],"lastnames":["Huang"],"firstnames":["Wei"],"suffixes":[]}],"month":"April","year":"2026","note":"arXiv:2604.11912 [cs.LG]","keywords":"Computer Science - Artificial Intelligence, Computer Science - Machine Learning","bibtex":"@misc{huang_how_2026,\n\ttitle = {How {Transformers} {Learn} to {Plan} via {Multi}-{Token} {Prediction}},\n\turl = {http://arxiv.org/abs/2604.11912},\n\tdoi = {10.48550/arXiv.2604.11912},\n\tabstract = {While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.},\n\tlanguage = {en},\n\turldate = {2026-05-28},\n\tpublisher = {arXiv},\n\tauthor = {Huang, Jianhao and Zhou, Zhanpeng and Xia, Renqiu and Mirzasoleiman, Baharan and Su, Weijie and Huang, Wei},\n\tmonth = apr,\n\tyear = {2026},\n\tnote = {arXiv:2604.11912 [cs.LG]},\n\tkeywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning},\n}\n\n\n\n","author_short":["Huang, J.","Zhou, Z.","Xia, R.","Mirzasoleiman, B.","Su, W.","Huang, W."],"key":"huang_how_2026","id":"huang_how_2026","bibbaseid":"huang-zhou-xia-mirzasoleiman-su-huang-howtransformerslearntoplanviamultitokenprediction-2026","role":"author","urls":{"Paper":"http://arxiv.org/abs/2604.11912"},"keyword":["Computer Science - Artificial Intelligence","Computer Science - Machine Learning"],"metadata":{"authorlinks":{}},"downloads":0},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["computer science - artificial intelligence","computer science - machine learning"],"search_terms":["transformers","learn","plan","via","multi","token","prediction","huang","zhou","xia","mirzasoleiman","su","huang"],"title":"How Transformers Learn to Plan via Multi-Token Prediction","year":2026}