Interpretable Visual Question Answering by Visual Grounding from Automatic Attention Annotations. Zhang, B., Niebles, J., & Soto, A. In WACV, 2019. Paper abstract bibtex 6 downloads A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve higher correlation to manually-annotated groundings than alternative approaches, even in the case of state-of-the-art algorithms that are directly trained with human grounding annotations.
@InProceedings{ ben:etal:2018,
author = {B. Zhang and JC. Niebles and A. Soto},
title = {Interpretable Visual Question Answering by Visual
Grounding from Automatic Attention Annotations},
booktitle = {WACV},
year = {2019},
abstract = {A key aspect of VQA models that are interpretable is their
ability to ground their answers to relevant regions in the
image. Current approaches with this capability rely on
supervised learning and human annotated groundings to train
attention mechanisms inside the VQA architecture.
Unfortunately, obtaining human annotations specific for
visual grounding is difficult and expensive. In this work,
we demonstrate that we can effectively train a VQA
architecture with grounding supervision that can be
automatically obtained from available region descriptions
and object annotations. We also show that our model trained
with this mined supervision generates visual groundings
that achieve higher correlation to manually-annotated
groundings than alternative approaches, even in the case of
state-of-the-art algorithms that are directly trained with
human grounding annotations.},
url = {https://arxiv.org/abs/1808.00265}
}
Downloads: 6
{"_id":"t6mX5bNeDyoRy6ZcQ","bibbaseid":"zhang-niebles-soto-interpretablevisualquestionansweringbyvisualgroundingfromautomaticattentionannotations-2019","authorIDs":["32ZR23o2BFySHbtQK","3ear6KFZSRqbj6YeT","4Pq6KLaQ8jKGXHZWH","54578d9a2abc8e9f370004f0","5e126ca5a4cabfdf01000053","5e158f76f1f31adf01000118","5e16174bf67f7dde010003ad","5e1f631ae8f5ddde010000eb","5e1f7182e8f5ddde010001ff","5e26da3642065ede01000066","5e3acefaf2a00cdf010001c8","5e62c3aecb259cde010000f9","5e65830c6e5f4cf3010000e7","5e666dfc46e828de010002c9","6cMBYieMJhf6Nd58M","6w6sGsxYSK2Quk6yZ","7xDcntrrtC62vkWM5","ARw5ReidxxZii9TTZ","DQ4JRTTWkvKXtCNCp","GbYBJvxugXMriQwbi","HhRoRmBvwWfD4oLyK","JFk6x26H6LZMoht2n","JvArGGu5qM6EvSCvB","LpqQBhFH3PxepH9KY","MT4TkSGzAp69M3dGt","QFECgvB5v2i4j2Qzs","RKv56Kes3h6FwEa55","Rb9TkQ3KkhGAaNyXq","RdND8NxcJDsyZdkcK","SpKJ5YujbHKZnHc4v","TSRdcx4bbYKqcGbDg","W8ogS2GJa6sQKy26c","WTi3X2fT8dzBN5d8b","WfZbctNQYDBaiYW6n","XZny8xuqwfoxzhBCB","Xk2Q5qedS5MFHvjEW","bbARiTJLYS79ZMFbk","cBxsyeZ37EucQeBYK","cFyFQps7W3Sa2Wope","dGRBfr8zhMmbwK6eP","eRLgwkrEk7T7Lmzmf","fMYSCX8RMZap548vv","g6iKCQCFnJgKYYHaP","h2hTcQYuf2PB3oF8t","h83jBvZYJPJGutQrs","jAtuJBcGhng4Lq2Nd","pMoo2gotJcdDPwfrw","q5Zunk5Y2ruhw5vyq","rzNGhqxkbt2MvGY29","uC8ATA8AfngWpYLBq","uoJ7BKv28Q6TtPmPp","vMiJzqEKCsBxBEa3v","vQE6iTPpjxpuLip2Z","wQDRsDjhgpMJDGxWX","wbNg79jvDpzX9zHLK","wk86BgRiooBjy323E","zCbPxKnQGgDHiHMWn","zf9HENjsAzdWLMDAu"],"author_short":["Zhang, B.","Niebles, J.","Soto, A."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"firstnames":["B."],"propositions":[],"lastnames":["Zhang"],"suffixes":[]},{"firstnames":["JC."],"propositions":[],"lastnames":["Niebles"],"suffixes":[]},{"firstnames":["A."],"propositions":[],"lastnames":["Soto"],"suffixes":[]}],"title":"Interpretable Visual Question Answering by Visual Grounding from Automatic Attention Annotations","booktitle":"WACV","year":"2019","abstract":"A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve higher correlation to manually-annotated groundings than alternative approaches, even in the case of state-of-the-art algorithms that are directly trained with human grounding annotations.","url":"https://arxiv.org/abs/1808.00265","bibtex":"@InProceedings{\t ben:etal:2018,\n author\t= {B. Zhang and JC. Niebles and A. Soto},\n title\t\t= {Interpretable Visual Question Answering by Visual\n\t\t Grounding from Automatic Attention Annotations},\n booktitle\t= {WACV},\n year\t\t= {2019},\n abstract\t= {A key aspect of VQA models that are interpretable is their\n\t\t ability to ground their answers to relevant regions in the\n\t\t image. Current approaches with this capability rely on\n\t\t supervised learning and human annotated groundings to train\n\t\t attention mechanisms inside the VQA architecture.\n\t\t Unfortunately, obtaining human annotations specific for\n\t\t visual grounding is difficult and expensive. In this work,\n\t\t we demonstrate that we can effectively train a VQA\n\t\t architecture with grounding supervision that can be\n\t\t automatically obtained from available region descriptions\n\t\t and object annotations. We also show that our model trained\n\t\t with this mined supervision generates visual groundings\n\t\t that achieve higher correlation to manually-annotated\n\t\t groundings than alternative approaches, even in the case of\n\t\t state-of-the-art algorithms that are directly trained with\n\t\t human grounding annotations.},\n url\t\t= {https://arxiv.org/abs/1808.00265}\n}\n\n","author_short":["Zhang, B.","Niebles, J.","Soto, A."],"key":"ben:etal:2018","id":"ben:etal:2018","bibbaseid":"zhang-niebles-soto-interpretablevisualquestionansweringbyvisualgroundingfromautomaticattentionannotations-2019","role":"author","urls":{"Paper":"https://arxiv.org/abs/1808.00265"},"metadata":{"authorlinks":{"soto, a":"https://asoto.ing.puc.cl/publications/"}},"downloads":6},"bibtype":"inproceedings","biburl":"https://raw.githubusercontent.com/ialab-puc/ialab.ing.puc.cl/master/pubs.bib","creationDate":"2019-05-20T21:47:00.409Z","downloads":6,"keywords":[],"search_terms":["interpretable","visual","question","answering","visual","grounding","automatic","attention","annotations","zhang","niebles","soto"],"title":"Interpretable Visual Question Answering by Visual Grounding from Automatic Attention Annotations","year":2019,"dataSources":["3YPRCmmijLqF4qHXd","sg6yZ29Z2xB5xP79R","m8qFBfFbjk9qWjcmJ","QjT2DEZoWmQYxjHXS"]}