MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax-01: Scaling Foundation Models with Lightning Attention. MiniMax, Li, A., Gong, B., Yang, B., Shan, B., Liu, C., Zhu, C., Zhang, C., Guo, C., Chen, D., Li, D., Jiao, E., Li, G., Zhang, G., Sun, H., Dong, H., Zhu, J., Zhuang, J., Song, J., Zhu, J., Han, J., Li, J., Xie, J., Xu, J., Yan, J., Zhang, K., Xiao, K., Kang, K., Han, L., Wang, L., Yu, L., Feng, L., Zheng, L., Chai, L., Xing, L., Ju, M., Chi, M., Zhang, M., Huang, P., Niu, P., Li, P., Zhao, P., Yang, Q., Xu, Q., Wang, Q., Wang, Q., Li, Q., Leng, R., Shi, S., Yu, S., Li, S., Zhu, S., Huang, T., Liang, T., Sun, W., Sun, W., Cheng, W., Li, W., Song, X., Su, X., Han, X., Zhang, X., Hou, X., Min, X., Zou, X., Shen, X., Gong, Y., Zhu, Y., Zhou, Y., Zhong, Y., Hu, Y., Fan, Y., Yu, Y., Yang, Y., Li, Y., Huang, Y., Li, Y., Huang, Y., Xu, Y., Mao, Y., Li, Z., Li, Z., Tao, Z., Ying, Z., Cong, Z., Qin, Z., Fan, Z., Yu, Z., Jiang, Z., & Wu, Z. January, 2025. arXiv:2501.08313 [cs]

Paper doi abstract bibtex

We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

@misc{minimax_minimax-01_2025,
	title = {{MiniMax}-01: {Scaling} {Foundation} {Models} with {Lightning} {Attention}},
	shorttitle = {{MiniMax}-01},
	url = {http://arxiv.org/abs/2501.08313},
	doi = {10.48550/arXiv.2501.08313},
	abstract = {We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.},
	urldate = {2025-01-21},
	publisher = {arXiv},
	author = {MiniMax and Li, Aonian and Gong, Bangwei and Yang, Bo and Shan, Boji and Liu, Chang and Zhu, Cheng and Zhang, Chunhao and Guo, Congchao and Chen, Da and Li, Dong and Jiao, Enwei and Li, Gengxin and Zhang, Guojun and Sun, Haohai and Dong, Houze and Zhu, Jiadai and Zhuang, Jiaqi and Song, Jiayuan and Zhu, Jin and Han, Jingtao and Li, Jingyang and Xie, Junbin and Xu, Junhao and Yan, Junjie and Zhang, Kaishun and Xiao, Kecheng and Kang, Kexi and Han, Le and Wang, Leyang and Yu, Lianfei and Feng, Liheng and Zheng, Lin and Chai, Linbo and Xing, Long and Ju, Meizhi and Chi, Mingyuan and Zhang, Mozhi and Huang, Peikai and Niu, Pengcheng and Li, Pengfei and Zhao, Pengyu and Yang, Qi and Xu, Qidi and Wang, Qiexiang and Wang, Qin and Li, Qiuhui and Leng, Ruitao and Shi, Shengmin and Yu, Shuqi and Li, Sichen and Zhu, Songquan and Huang, Tao and Liang, Tianrun and Sun, Weigao and Sun, Weixuan and Cheng, Weiyu and Li, Wenkai and Song, Xiangjun and Su, Xiao and Han, Xiaodong and Zhang, Xinjie and Hou, Xinzhu and Min, Xu and Zou, Xun and Shen, Xuyang and Gong, Yan and Zhu, Yingjie and Zhou, Yipeng and Zhong, Yiran and Hu, Yongyi and Fan, Yuanxiang and Yu, Yue and Yang, Yufeng and Li, Yuhao and Huang, Yunan and Li, Yunji and Huang, Yunpeng and Xu, Yunzhi and Mao, Yuxin and Li, Zehan and Li, Zekang and Tao, Zewei and Ying, Zewen and Cong, Zhaoyang and Qin, Zhen and Fan, Zhenhua and Yu, Zhihang and Jiang, Zhuo and Wu, Zijia},
	month = jan,
	year = {2025},
	note = {arXiv:2501.08313 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition},
}

Downloads: 0