Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models. Bavaresco, A., de Heer Kloots, M., Pezzelle, S., & Fernández, R. ArXiv, Preprints. 2025.
Paper abstract bibtex 2 downloads Text representations from language models have proven remarkably predictive of human neural activity involved in language processing, with the recent transformer-based models outperforming previous architectures in downstream tasks and prediction of brain responses. However, the word representations learnt by language-only models may be limited in that they lack sensory information from other modalities, which several cognitive and neuroscience studies showed to be reflected in human meaning representations. Here, we leverage current pre-trained vision-language models (VLMs) to investigate whether the integration of visuo-linguistic information they operate leads to representations that are more aligned with human brain activity than those obtained by models trained with language-only input. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or a picture. Our results reveal that VLM representations correlate more strongly than those by language-only models with activations in brain areas functionally related to language processing. Additionally, we find that transformer-based vision-language encoders – e.g., LXMERT and VisualBERT – yield more brain-aligned representations than generative VLMs, whose autoregressive abilities do not seem to provide an advantage when modelling single words. Finally, our ablation analyses suggest that the high brain alignment achieved by some of the VLMs we evaluate results from semantic information acquired specifically during multimodal pretraining as opposed to being already encoded in their unimodal modules. Altogether, our findings indicate an advantage of multimodal models in predicting human brain activations, which reveals that modelling language and vision integration has the potential to capture the multimodal nature of human concept representations.
@article{Bavaresco-deHeerKloots-etal-arxiv-2025,
title={Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models},
author={Anna Bavaresco and Marianne de Heer Kloots and Sandro Pezzelle and Raquel Fern\'{a}ndez},
journal = {ArXiv},
year={Preprints},
url={https://arxiv.org/abs/2407.17914},
note={2025.},
abstract={Text representations from language models have proven remarkably predictive of human neural activity involved in language processing,
with the recent transformer-based models outperforming previous architectures in downstream tasks and prediction of brain responses.
However, the word representations learnt by language-only models may be limited in that they lack sensory information from other modalities,
which several cognitive and neuroscience studies showed to be reflected in human meaning representations. Here, we leverage current pre-trained
vision-language models (VLMs) to investigate whether the integration of visuo-linguistic information they operate leads to representations
that are more aligned with human brain activity than those obtained by models trained with language-only input. We focus on fMRI responses
recorded while participants read concept words in the context of either a full sentence or a picture. Our results reveal that VLM representations
correlate more strongly than those by language-only models with activations in brain areas functionally related to language processing.
Additionally, we find that transformer-based vision-language encoders -- e.g., LXMERT and VisualBERT -- yield more brain-aligned representations
than generative VLMs, whose autoregressive abilities do not seem to provide an advantage when modelling single words. Finally, our ablation
analyses suggest that the high brain alignment achieved by some of the VLMs we evaluate results from semantic information acquired specifically
during multimodal pretraining as opposed to being already encoded in their unimodal modules. Altogether, our findings indicate an advantage of
multimodal models in predicting human brain activations, which reveals that modelling language and vision integration has the potential to
capture the multimodal nature of human concept representations.}
}
Downloads: 2
{"_id":"qqm8va5dL6Fe2wZjK","bibbaseid":"bavaresco-deheerkloots-pezzelle-fernndez-modellingmultimodalintegrationinhumanconceptprocessingwithvisionlanguagemodels-preprints","author_short":["Bavaresco, A.","de Heer Kloots, M.","Pezzelle, S.","Fernández, R."],"bibdata":{"bibtype":"article","type":"article","title":"Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models","author":[{"firstnames":["Anna"],"propositions":[],"lastnames":["Bavaresco"],"suffixes":[]},{"firstnames":["Marianne"],"propositions":["de"],"lastnames":["Heer","Kloots"],"suffixes":[]},{"firstnames":["Sandro"],"propositions":[],"lastnames":["Pezzelle"],"suffixes":[]},{"firstnames":["Raquel"],"propositions":[],"lastnames":["Fernández"],"suffixes":[]}],"journal":"ArXiv","year":"Preprints","url":"https://arxiv.org/abs/2407.17914","note":"2025.","abstract":"Text representations from language models have proven remarkably predictive of human neural activity involved in language processing, with the recent transformer-based models outperforming previous architectures in downstream tasks and prediction of brain responses. However, the word representations learnt by language-only models may be limited in that they lack sensory information from other modalities, which several cognitive and neuroscience studies showed to be reflected in human meaning representations. Here, we leverage current pre-trained vision-language models (VLMs) to investigate whether the integration of visuo-linguistic information they operate leads to representations that are more aligned with human brain activity than those obtained by models trained with language-only input. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or a picture. Our results reveal that VLM representations correlate more strongly than those by language-only models with activations in brain areas functionally related to language processing. Additionally, we find that transformer-based vision-language encoders – e.g., LXMERT and VisualBERT – yield more brain-aligned representations than generative VLMs, whose autoregressive abilities do not seem to provide an advantage when modelling single words. Finally, our ablation analyses suggest that the high brain alignment achieved by some of the VLMs we evaluate results from semantic information acquired specifically during multimodal pretraining as opposed to being already encoded in their unimodal modules. Altogether, our findings indicate an advantage of multimodal models in predicting human brain activations, which reveals that modelling language and vision integration has the potential to capture the multimodal nature of human concept representations.","bibtex":"@article{Bavaresco-deHeerKloots-etal-arxiv-2025,\ntitle={Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models},\nauthor={Anna Bavaresco and Marianne de Heer Kloots and Sandro Pezzelle and Raquel Fern\\'{a}ndez},\njournal = {ArXiv},\nyear={Preprints},\nurl={https://arxiv.org/abs/2407.17914},\nnote={2025.},\nabstract={Text representations from language models have proven remarkably predictive of human neural activity involved in language processing, \nwith the recent transformer-based models outperforming previous architectures in downstream tasks and prediction of brain responses. \nHowever, the word representations learnt by language-only models may be limited in that they lack sensory information from other modalities, \nwhich several cognitive and neuroscience studies showed to be reflected in human meaning representations. Here, we leverage current pre-trained \nvision-language models (VLMs) to investigate whether the integration of visuo-linguistic information they operate leads to representations \nthat are more aligned with human brain activity than those obtained by models trained with language-only input. We focus on fMRI responses \nrecorded while participants read concept words in the context of either a full sentence or a picture. Our results reveal that VLM representations \ncorrelate more strongly than those by language-only models with activations in brain areas functionally related to language processing. \nAdditionally, we find that transformer-based vision-language encoders -- e.g., LXMERT and VisualBERT -- yield more brain-aligned representations \nthan generative VLMs, whose autoregressive abilities do not seem to provide an advantage when modelling single words. Finally, our ablation \nanalyses suggest that the high brain alignment achieved by some of the VLMs we evaluate results from semantic information acquired specifically \nduring multimodal pretraining as opposed to being already encoded in their unimodal modules. Altogether, our findings indicate an advantage of \nmultimodal models in predicting human brain activations, which reveals that modelling language and vision integration has the potential to \ncapture the multimodal nature of human concept representations.}\n}\n\n","author_short":["Bavaresco, A.","de Heer Kloots, M.","Pezzelle, S.","Fernández, R."],"key":"Bavaresco-deHeerKloots-etal-arxiv-2025","id":"Bavaresco-deHeerKloots-etal-arxiv-2025","bibbaseid":"bavaresco-deheerkloots-pezzelle-fernndez-modellingmultimodalintegrationinhumanconceptprocessingwithvisionlanguagemodels-preprints","role":"author","urls":{"Paper":"https://arxiv.org/abs/2407.17914"},"metadata":{"authorlinks":{}},"downloads":2},"bibtype":"article","biburl":"https://raw.githubusercontent.com/dmg-illc/dmg/master/bibbase/dmg-preprints.bib","dataSources":["BBpukpTdstoNb9nak"],"keywords":[],"search_terms":["modelling","multimodal","integration","human","concept","processing","vision","language","models","bavaresco","de heer kloots","pezzelle","fernández"],"title":"Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models","year":null,"downloads":2}