How Transformers Learn to Plan via Multi-Token Prediction

How Transformers Learn to Plan via Multi-Token Prediction. Huang, J., Zhou, Z., Xia, R., Mirzasoleiman, B., Su, W., & Huang, W. April, 2026. arXiv:2604.11912 [cs.LG]

Paper doi abstract bibtex

While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.

@misc{huang_how_2026,
	title = {How {Transformers} {Learn} to {Plan} via {Multi}-{Token} {Prediction}},
	url = {http://arxiv.org/abs/2604.11912},
	doi = {10.48550/arXiv.2604.11912},
	abstract = {While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.},
	language = {en},
	urldate = {2026-05-28},
	publisher = {arXiv},
	author = {Huang, Jianhao and Zhou, Zhanpeng and Xia, Renqiu and Mirzasoleiman, Baharan and Su, Weijie and Huang, Wei},
	month = apr,
	year = {2026},
	note = {arXiv:2604.11912 [cs.LG]},
	keywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning, WG: Explorable},
}

Downloads: 0

{"_id":"6dirzW4HCATdt6svH","bibbaseid":"huang-zhou-xia-mirzasoleiman-su-huang-howtransformerslearntoplanviamultitokenprediction-2026","author_short":["Huang, J.","Zhou, Z.","Xia, R.","Mirzasoleiman, B.","Su, W.","Huang, W."],"bibdata":{"bibtype":"misc","type":"misc","title":"How Transformers Learn to Plan via Multi-Token Prediction","url":"http://arxiv.org/abs/2604.11912","doi":"10.48550/arXiv.2604.11912","abstract":"While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.","language":"en","urldate":"2026-05-28","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Huang"],"firstnames":["Jianhao"],"suffixes":[]},{"propositions":[],"lastnames":["Zhou"],"firstnames":["Zhanpeng"],"suffixes":[]},{"propositions":[],"lastnames":["Xia"],"firstnames":["Renqiu"],"suffixes":[]},{"propositions":[],"lastnames":["Mirzasoleiman"],"firstnames":["Baharan"],"suffixes":[]},{"propositions":[],"lastnames":["Su"],"firstnames":["Weijie"],"suffixes":[]},{"propositions":[],"lastnames":["Huang"],"firstnames":["Wei"],"suffixes":[]}],"month":"April","year":"2026","note":"arXiv:2604.11912 [cs.LG]","keywords":"Computer Science - Artificial Intelligence, Computer Science - Machine Learning, WG: Explorable","bibtex":"@misc{huang_how_2026,\n\ttitle = {How {Transformers} {Learn} to {Plan} via {Multi}-{Token} {Prediction}},\n\turl = {http://arxiv.org/abs/2604.11912},\n\tdoi = {10.48550/arXiv.2604.11912},\n\tabstract = {While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.},\n\tlanguage = {en},\n\turldate = {2026-05-28},\n\tpublisher = {arXiv},\n\tauthor = {Huang, Jianhao and Zhou, Zhanpeng and Xia, Renqiu and Mirzasoleiman, Baharan and Su, Weijie and Huang, Wei},\n\tmonth = apr,\n\tyear = {2026},\n\tnote = {arXiv:2604.11912 [cs.LG]},\n\tkeywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning, WG: Explorable},\n}\n\n\n\n","author_short":["Huang, J.","Zhou, Z.","Xia, R.","Mirzasoleiman, B.","Su, W.","Huang, W."],"key":"huang_how_2026","id":"huang_how_2026","bibbaseid":"huang-zhou-xia-mirzasoleiman-su-huang-howtransformerslearntoplanviamultitokenprediction-2026","role":"author","urls":{"Paper":"http://arxiv.org/abs/2604.11912"},"keyword":["Computer Science - Artificial Intelligence","Computer Science - Machine Learning","WG: Explorable"],"metadata":{"authorlinks":{}},"downloads":0},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["computer science - artificial intelligence","computer science - machine learning","wg: explorable"],"search_terms":["transformers","learn","plan","via","multi","token","prediction","huang","zhou","xia","mirzasoleiman","su","huang"],"title":"How Transformers Learn to Plan via Multi-Token Prediction","year":2026}