ViQuAE, a Dataset for Knowledge-Based Visual Question Answering about Named Entities. Lerner, P., Ferret, O., Guinaudeau, C., Le Borgne, H., Besançon, R., Moreno, J. G., & Lovón Melgarejo, J. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, of SIGIR '22, pages 3108–3120, New York, NY, USA, 2022. Association for Computing Machinery.
ViQuAE, a Dataset for Knowledge-Based Visual Question Answering about Named Entities [link]Paper  ViQuAE, a Dataset for Knowledge-Based Visual Question Answering about Named Entities [link]Hal  ViQuAE, a Dataset for Knowledge-Based Visual Question Answering about Named Entities [link]Code  doi  abstract   bibtex   3 downloads  
Whether to retrieve, answer, translate, or reason, multimodality opens up new challenges and perspectives. In this context, we are interested in answering questions about named entities grounded in a visual context using a Knowledge Base (KB). To benchmark this task, called KVQAE (Knowledge-based Visual Question Answering about named Entities), we provide ViQuAE, a dataset of 3.7K questions paired with images. This is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). The dataset is annotated using a semi-automatic method. We also propose a KB composed of 1.5M Wikipedia articles paired with images. To set a baseline on the benchmark, we address KVQAE as a two-stage problem: Information Retrieval and Reading Comprehension, with both zero- and few-shot learning methods. The experiments empirically demonstrate the difficulty of the task, especially when questions are not about persons. This work paves the way for better multimodal entity representations and question answering. The dataset, KB, code, and semi-automatic annotation pipeline are freely available at https://github.com/PaulLerner/ViQuAE.
@inproceedings{lerner2022viquae,
  author    = {Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille and {Le Borgne}, Herv\'{e} and Besan\c{c}on, Romaric and Moreno, Jose G. and Lov\'{o}n Melgarejo, Jes\'{u}s},
  title     = {ViQuAE, a Dataset for Knowledge-Based Visual Question Answering about Named Entities},
  year      = {2022},
  isbn      = {9781450387323},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  url       = {https://doi.org/10.1145/3477495.3531753},
  doi       = {10.1145/3477495.3531753},
  booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages     = {3108–3120},
  numpages  = {13},
  location  = {Madrid, Spain},
  series    = {SIGIR '22},
  url_HAL   = {https://universite-paris-saclay.hal.science/hal-03650618/document},
  url_Code  = {https://github.com/PaulLerner/ViQuAE},
  abstract  = {Whether to retrieve, answer, translate, or reason, multimodality opens up new challenges and perspectives. In this context, we are interested in answering questions about named entities grounded in a visual context using a Knowledge Base (KB). To benchmark this task, called KVQAE (Knowledge-based Visual Question Answering about named Entities), we provide ViQuAE, a dataset of 3.7K questions paired with images. This is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). The dataset is annotated using a semi-automatic method. We also propose a KB composed of 1.5M Wikipedia articles paired with images. To set a baseline on the benchmark, we address KVQAE as a two-stage problem: Information Retrieval and Reading Comprehension, with both zero- and few-shot learning methods. The experiments empirically demonstrate the difficulty of the task, especially when questions are not about persons. This work paves the way for better multimodal entity representations and question answering. The dataset, KB, code, and semi-automatic annotation pipeline are
freely available at https://github.com/PaulLerner/ViQuAE.},
  keywords  = {conf-int,kvqae}
}

Downloads: 3