Interpretable Visual Question Answering by Visual Grounding from Automatic Attention Annotations

Interpretable Visual Question Answering by Visual Grounding from Automatic Attention Annotations. Zhang, B., Niebles, J., & Soto, A. In WACV, 2019.

Paper abstract bibtex 6 downloads

A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve higher correlation to manually-annotated groundings than alternative approaches, even in the case of state-of-the-art algorithms that are directly trained with human grounding annotations.

@InProceedings{	  ben:etal:2018,
  author	= {B. Zhang and JC. Niebles and A. Soto},
  title		= {Interpretable Visual Question Answering by Visual
		  Grounding from Automatic Attention Annotations},
  booktitle	= {WACV},
  year		= {2019},
  abstract	= {A key aspect of VQA models that are interpretable is their
		  ability to ground their answers to relevant regions in the
		  image. Current approaches with this capability rely on
		  supervised learning and human annotated groundings to train
		  attention mechanisms inside the VQA architecture.
		  Unfortunately, obtaining human annotations specific for
		  visual grounding is difficult and expensive. In this work,
		  we demonstrate that we can effectively train a VQA
		  architecture with grounding supervision that can be
		  automatically obtained from available region descriptions
		  and object annotations. We also show that our model trained
		  with this mined supervision generates visual groundings
		  that achieve higher correlation to manually-annotated
		  groundings than alternative approaches, even in the case of
		  state-of-the-art algorithms that are directly trained with
		  human grounding annotations.},
  url		= {https://arxiv.org/abs/1808.00265}
}

Downloads: 6