Grounding visual concepts for zero-shot Event Detection and Event Captioning

Grounding visual concepts for zero-shot Event Detection and Event Captioning. Li, Z., Chang, X., Yao, L., Pan, S., Zongyuan, G., & Zhang, H. In Tang, J. & Aditya Prakash, B., editors, Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 297–305, 8, 2020. Association for Computing Machinery (ACM).
doi abstract bibtex

The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.

@inproceedings{
 title = {Grounding visual concepts for zero-shot Event Detection and Event Captioning},
 type = {inproceedings},
 year = {2020},
 keywords = {Grounding Visual Concepts,Multimedia Event Captioning,Multimedia Event Detection,Zero-shot Learning},
 pages = {297–305},
 month = {8},
 publisher = {Association for Computing Machinery (ACM)},
 city = {United States of America},
 id = {1cf5bb14-7c7b-36e1-8fc3-bf0e6c8eb8f1},
 created = {2021-06-15T12:45:27.016Z},
 file_attached = {false},
 profile_id = {079852a8-52df-3ac8-a41c-8bebd97d6b2b},
 last_modified = {2021-07-19T00:58:37.666Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 citation_key = {e59e737b5c8a44daa3d1ab8ea004104b},
 source_type = {inproceedings},
 notes = {ACM International Conference on Knowledge Discovery and Data Mining 2020, KDD\textquoteright20 ; Conference date: 23-08-2020 Through 27-08-2020},
 folder_uuids = {f3b8cf54-f818-49eb-a899-33ac83c5e58d},
 private_publication = {false},
 abstract = {The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.},
 bibtype = {inproceedings},
 author = {Li, Zhihui and Chang, Xiaojun and Yao, Lina and Pan, Shirui and Zongyuan, Ge and Zhang, Huaxiang},
 editor = {Tang, Jiliang and Aditya Prakash, B},
 doi = {10.1145/3394486.3403072},
 booktitle = {Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD)}
}

Downloads: 0

{"_id":"tFoWqGCxtntj9bman","bibbaseid":"li-chang-yao-pan-zongyuan-zhang-groundingvisualconceptsforzeroshoteventdetectionandeventcaptioning-2020","author_short":["Li, Z.","Chang, X.","Yao, L.","Pan, S.","Zongyuan, G.","Zhang, H."],"bibdata":{"title":"Grounding visual concepts for zero-shot Event Detection and Event Captioning","type":"inproceedings","year":"2020","keywords":"Grounding Visual Concepts,Multimedia Event Captioning,Multimedia Event Detection,Zero-shot Learning","pages":"297–305","month":"8","publisher":"Association for Computing Machinery (ACM)","city":"United States of America","id":"1cf5bb14-7c7b-36e1-8fc3-bf0e6c8eb8f1","created":"2021-06-15T12:45:27.016Z","file_attached":false,"profile_id":"079852a8-52df-3ac8-a41c-8bebd97d6b2b","last_modified":"2021-07-19T00:58:37.666Z","read":false,"starred":false,"authored":"true","confirmed":"true","hidden":false,"citation_key":"e59e737b5c8a44daa3d1ab8ea004104b","source_type":"inproceedings","notes":"ACM International Conference on Knowledge Discovery and Data Mining 2020, KDD\\textquoteright20 ; Conference date: 23-08-2020 Through 27-08-2020","folder_uuids":"f3b8cf54-f818-49eb-a899-33ac83c5e58d","private_publication":false,"abstract":"The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.","bibtype":"inproceedings","author":"Li, Zhihui and Chang, Xiaojun and Yao, Lina and Pan, Shirui and Zongyuan, Ge and Zhang, Huaxiang","editor":"Tang, Jiliang and Aditya Prakash, B","doi":"10.1145/3394486.3403072","booktitle":"Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD)","bibtex":"@inproceedings{\n title = {Grounding visual concepts for zero-shot Event Detection and Event Captioning},\n type = {inproceedings},\n year = {2020},\n keywords = {Grounding Visual Concepts,Multimedia Event Captioning,Multimedia Event Detection,Zero-shot Learning},\n pages = {297–305},\n month = {8},\n publisher = {Association for Computing Machinery (ACM)},\n city = {United States of America},\n id = {1cf5bb14-7c7b-36e1-8fc3-bf0e6c8eb8f1},\n created = {2021-06-15T12:45:27.016Z},\n file_attached = {false},\n profile_id = {079852a8-52df-3ac8-a41c-8bebd97d6b2b},\n last_modified = {2021-07-19T00:58:37.666Z},\n read = {false},\n starred = {false},\n authored = {true},\n confirmed = {true},\n hidden = {false},\n citation_key = {e59e737b5c8a44daa3d1ab8ea004104b},\n source_type = {inproceedings},\n notes = {ACM International Conference on Knowledge Discovery and Data Mining 2020, KDD\\textquoteright20 ; Conference date: 23-08-2020 Through 27-08-2020},\n folder_uuids = {f3b8cf54-f818-49eb-a899-33ac83c5e58d},\n private_publication = {false},\n abstract = {The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.},\n bibtype = {inproceedings},\n author = {Li, Zhihui and Chang, Xiaojun and Yao, Lina and Pan, Shirui and Zongyuan, Ge and Zhang, Huaxiang},\n editor = {Tang, Jiliang and Aditya Prakash, B},\n doi = {10.1145/3394486.3403072},\n booktitle = {Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD)}\n}","author_short":["Li, Z.","Chang, X.","Yao, L.","Pan, S.","Zongyuan, G.","Zhang, H."],"editor_short":["Tang, J.","Aditya Prakash, B."],"biburl":"https://bibbase.org/service/mendeley/079852a8-52df-3ac8-a41c-8bebd97d6b2b","bibbaseid":"li-chang-yao-pan-zongyuan-zhang-groundingvisualconceptsforzeroshoteventdetectionandeventcaptioning-2020","role":"author","urls":{},"keyword":["Grounding Visual Concepts","Multimedia Event Captioning","Multimedia Event Detection","Zero-shot Learning"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/service/mendeley/079852a8-52df-3ac8-a41c-8bebd97d6b2b","dataSources":["gmNB3pprCEczjrwyo","SRK2HijFQemp6YcG3","6aBwAHJEc5NXgW2bw","dJWKgXqQFEYPXFiST","HPBzCWvwA7wkE6Dnk","uEtXodz95HRDCHN22","ya2CyA73rpZseyrZ8","2252seNhipfTmjEBQ","vpu5W6z2tNtLkKjsj","fcdT59YHNhp9Euu5k","HmWAviNezgcH2jK9X","ukuCjJZTpTcMx84Tz","m7B7iLMuqoXuENyof","AcaDrFjGvc6GmT8Yb","deLmc96Gkkc4F882F"],"keywords":["grounding visual concepts","multimedia event captioning","multimedia event detection","zero-shot learning"],"search_terms":["grounding","visual","concepts","zero","shot","event","detection","event","captioning","li","chang","yao","pan","zongyuan","zhang"],"title":"Grounding visual concepts for zero-shot Event Detection and Event Captioning","year":2020}