How Transformers Learn to Plan via Multi-Token Prediction. Huang, J., Zhou, Z., Xia, R., Mirzasoleiman, B., Su, W., & Huang, W. April, 2026. arXiv:2604.11912 [cs.LG]
How Transformers Learn to Plan via Multi-Token Prediction [link]Paper  doi  abstract   bibtex   
While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.
@misc{huang_how_2026,
	title = {How {Transformers} {Learn} to {Plan} via {Multi}-{Token} {Prediction}},
	url = {http://arxiv.org/abs/2604.11912},
	doi = {10.48550/arXiv.2604.11912},
	abstract = {While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.},
	language = {en},
	urldate = {2026-05-28},
	publisher = {arXiv},
	author = {Huang, Jianhao and Zhou, Zhanpeng and Xia, Renqiu and Mirzasoleiman, Baharan and Su, Weijie and Huang, Wei},
	month = apr,
	year = {2026},
	note = {arXiv:2604.11912 [cs.LG]},
	keywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning},
}

Downloads: 0