LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning. You, Z., Nie, S., Zhang, X., Hu, J., Zhou, J., Lu, Z., Wen, J., & Li, C. May, 2025. arXiv:2505.16933 [cs]

Paper doi abstract bibtex

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.

@misc{you_llada-v_2025,
	title = {{LLaDA}-{V}: {Large} {Language} {Diffusion} {Models} with {Visual} {Instruction} {Tuning}},
	shorttitle = {{LLaDA}-{V}},
	url = {http://arxiv.org/abs/2505.16933},
	doi = {10.48550/arXiv.2505.16933},
	abstract = {In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.},
	urldate = {2025-05-23},
	publisher = {arXiv},
	author = {You, Zebin and Nie, Shen and Zhang, Xiaolu and Hu, Jun and Zhou, Jun and Lu, Zhiwu and Wen, Ji-Rong and Li, Chongxuan},
	month = may,
	year = {2025},
	note = {arXiv:2505.16933 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning},
}

Downloads: 0

{"_id":"2DjuStncmQirGsEvR","bibbaseid":"you-nie-zhang-hu-zhou-lu-wen-li-lladavlargelanguagediffusionmodelswithvisualinstructiontuning-2025","author_short":["You, Z.","Nie, S.","Zhang, X.","Hu, J.","Zhou, J.","Lu, Z.","Wen, J.","Li, C."],"bibdata":{"bibtype":"misc","type":"misc","title":"LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning","shorttitle":"LLaDA-V","url":"http://arxiv.org/abs/2505.16933","doi":"10.48550/arXiv.2505.16933","abstract":"In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.","urldate":"2025-05-23","publisher":"arXiv","author":[{"propositions":[],"lastnames":["You"],"firstnames":["Zebin"],"suffixes":[]},{"propositions":[],"lastnames":["Nie"],"firstnames":["Shen"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Xiaolu"],"suffixes":[]},{"propositions":[],"lastnames":["Hu"],"firstnames":["Jun"],"suffixes":[]},{"propositions":[],"lastnames":["Zhou"],"firstnames":["Jun"],"suffixes":[]},{"propositions":[],"lastnames":["Lu"],"firstnames":["Zhiwu"],"suffixes":[]},{"propositions":[],"lastnames":["Wen"],"firstnames":["Ji-Rong"],"suffixes":[]},{"propositions":[],"lastnames":["Li"],"firstnames":["Chongxuan"],"suffixes":[]}],"month":"May","year":"2025","note":"arXiv:2505.16933 [cs]","keywords":"Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning","bibtex":"@misc{you_llada-v_2025,\n\ttitle = {{LLaDA}-{V}: {Large} {Language} {Diffusion} {Models} with {Visual} {Instruction} {Tuning}},\n\tshorttitle = {{LLaDA}-{V}},\n\turl = {http://arxiv.org/abs/2505.16933},\n\tdoi = {10.48550/arXiv.2505.16933},\n\tabstract = {In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.},\n\turldate = {2025-05-23},\n\tpublisher = {arXiv},\n\tauthor = {You, Zebin and Nie, Shen and Zhang, Xiaolu and Hu, Jun and Zhou, Jun and Lu, Zhiwu and Wen, Ji-Rong and Li, Chongxuan},\n\tmonth = may,\n\tyear = {2025},\n\tnote = {arXiv:2505.16933 [cs]},\n\tkeywords = {Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning},\n}\n\n\n\n","author_short":["You, Z.","Nie, S.","Zhang, X.","Hu, J.","Zhou, J.","Lu, Z.","Wen, J.","Li, C."],"key":"you_llada-v_2025","id":"you_llada-v_2025","bibbaseid":"you-nie-zhang-hu-zhou-lu-wen-li-lladavlargelanguagediffusionmodelswithvisualinstructiontuning-2025","role":"author","urls":{"Paper":"http://arxiv.org/abs/2505.16933"},"keyword":["Computer Science - Computation and Language","Computer Science - Computer Vision and Pattern Recognition","Computer Science - Machine Learning"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/pa511","dataSources":["MpmemwLeQzDcKDq6x"],"keywords":["computer science - computation and language","computer science - computer vision and pattern recognition","computer science - machine learning"],"search_terms":["llada","large","language","diffusion","models","visual","instruction","tuning","you","nie","zhang","hu","zhou","lu","wen","li"],"title":"LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning","year":2025}