Entity-Aware Cross-Modal Pretraining for Knowledge-based Visual Question Answering. Adjali, O., Ferret, O., Ghannay, S., & Le Borgne, H. In European Conference on Information Retrieval (ECIR), 2025.
Entity-Aware Cross-Modal Pretraining for Knowledge-based Visual Question Answering [link]Hal  abstract   bibtex   1 download  
Knowledge-Aware Visual Question Answering about Entities (KVQAE) is a recent multimodal task aiming to answer visual questions about named entities from a multimodal knowledge base. In this context, we focus more particularly on cross-modal retrieval and propose to inject information about entities in the representations of both texts and images during their building through two pretraining auxiliary tasks, namely entity-level masked language modeling and entity type prediction. We show competitive results over existing approaches on 3 KVQAE standard benchmarks, revealing the benefit of raising entity awareness during cross-modal pretraining, specifically for the KVQAE task.
@inproceedings{adjali2025entitiy_aware,
  title     = {Entity-Aware Cross-Modal Pretraining for Knowledge-based Visual Question Answering},
  author    = {Adjali, Omar and Ferret, Olivier and Ghannay, Sahar and Le Borgne, Herv{\'e}},
  booktitle = {European Conference on Information Retrieval (ECIR)},
  year      = {2025},
  url_HAL   = {https://hal-lara.archives-ouvertes.fr/SHARP/cea-04910767},
  abstract  = {Knowledge-Aware Visual Question Answering about Entities (KVQAE) is a recent multimodal task aiming to answer visual questions about named entities from a multimodal knowledge base. In this context, we focus more particularly on cross-modal retrieval and propose to inject information about entities in the representations of both texts and images during their building through two pretraining auxiliary tasks, namely entity-level masked language modeling and entity type prediction. We show competitive results over existing approaches on 3 KVQAE standard benchmarks, revealing the benefit of raising entity awareness during cross-modal pretraining, specifically for the KVQAE task.},
  keywords  = {kvqae}
}

Downloads: 1