TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification

TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification. Villa, A., Perez-Rua, J., Araujo, V., Niebles, J., Escorcia, V., & Soto, A. In BMVC, 2021.

Paper abstract bibtex

Recently, few-shot learning has received increasing interest. Existing efforts have been focused on image classification, with very few attempts dedicated to the more challenging few-shot video classification problem. These few attempts aim to effectively exploit the temporal dimension in videos for better learning in low data regimes. However, they have largely ignored a key characteristic of video which could be vital for few-shot recognition, that is, videos are often accompanied by rich text descriptions. In this paper, for the first time, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Our model follows a transductive setting where query samples and support textual descriptions can be used to update the support set class prototype to further improve the task-adaptation ability of the model. Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.

@inproceedings{VillaEtAl:ACL:2021,
  author = {A. Villa and J. Perez-Rua and V. Araujo and J.C. Niebles and V. Escorcia and A. Soto},
  title = {TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification},
  booktitle = {{BMVC}},
  year = {2021},
  abstract = {Recently, few-shot learning has received increasing interest. Existing efforts have been focused on image classification, with very few attempts dedicated to the more challenging few-shot video classification problem. These few attempts aim to effectively exploit the temporal dimension in videos for better learning in low data regimes. However, they have largely ignored a key characteristic of video which could be vital for few-shot recognition, that is, videos are often accompanied by rich text descriptions. In this paper, for the first time, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Our model follows a transductive setting where query samples and support textual descriptions can be used to update the support set class prototype to further improve the task-adaptation ability of the model. Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.},
url = {https://arxiv.org/pdf/2106.11173.pdf},
}

Downloads: 0

{"_id":"AqXd88uNdh4YDXxDu","bibbaseid":"villa-perezrua-araujo-niebles-escorcia-soto-tnttextconditionednetworkwithtransductiveinferenceforfewshotvideoclassification-2021","author_short":["Villa, A.","Perez-Rua, J.","Araujo, V.","Niebles, J.","Escorcia, V.","Soto, A."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"firstnames":["A."],"propositions":[],"lastnames":["Villa"],"suffixes":[]},{"firstnames":["J."],"propositions":[],"lastnames":["Perez-Rua"],"suffixes":[]},{"firstnames":["V."],"propositions":[],"lastnames":["Araujo"],"suffixes":[]},{"firstnames":["J.C."],"propositions":[],"lastnames":["Niebles"],"suffixes":[]},{"firstnames":["V."],"propositions":[],"lastnames":["Escorcia"],"suffixes":[]},{"firstnames":["A."],"propositions":[],"lastnames":["Soto"],"suffixes":[]}],"title":"TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification","booktitle":"BMVC","year":"2021","abstract":"Recently, few-shot learning has received increasing interest. Existing efforts have been focused on image classification, with very few attempts dedicated to the more challenging few-shot video classification problem. These few attempts aim to effectively exploit the temporal dimension in videos for better learning in low data regimes. However, they have largely ignored a key characteristic of video which could be vital for few-shot recognition, that is, videos are often accompanied by rich text descriptions. In this paper, for the first time, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Our model follows a transductive setting where query samples and support textual descriptions can be used to update the support set class prototype to further improve the task-adaptation ability of the model. Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.","url":"https://arxiv.org/pdf/2106.11173.pdf","bibtex":"@inproceedings{VillaEtAl:ACL:2021,\n author = {A. Villa and J. Perez-Rua and V. Araujo and J.C. Niebles and V. Escorcia and A. Soto},\n title = {TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification},\n booktitle = {{BMVC}},\n year = {2021},\n abstract = {Recently, few-shot learning has received increasing interest. Existing efforts have been focused on image classification, with very few attempts dedicated to the more challenging few-shot video classification problem. These few attempts aim to effectively exploit the temporal dimension in videos for better learning in low data regimes. However, they have largely ignored a key characteristic of video which could be vital for few-shot recognition, that is, videos are often accompanied by rich text descriptions. In this paper, for the first time, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Our model follows a transductive setting where query samples and support textual descriptions can be used to update the support set class prototype to further improve the task-adaptation ability of the model. Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.},\nurl = {https://arxiv.org/pdf/2106.11173.pdf},\n}\n\n\n","author_short":["Villa, A.","Perez-Rua, J.","Araujo, V.","Niebles, J.","Escorcia, V.","Soto, A."],"key":"VillaEtAl:ACL:2021","id":"VillaEtAl:ACL:2021","bibbaseid":"villa-perezrua-araujo-niebles-escorcia-soto-tnttextconditionednetworkwithtransductiveinferenceforfewshotvideoclassification-2021","role":"author","urls":{"Paper":"https://arxiv.org/pdf/2106.11173.pdf"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://asoto.ing.puc.cl/AlvaroPapers.bib","dataSources":["3YPRCmmijLqF4qHXd","QjT2DEZoWmQYxjHXS"],"keywords":[],"search_terms":["tnt","text","conditioned","network","transductive","inference","few","shot","video","classification","villa","perez-rua","araujo","niebles","escorcia","soto"],"title":"TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification","year":2021,"downloads":1}