Deep Multimodal Data Fusion

Deep Multimodal Data Fusion. Zhao, F., Zhang, C., & Geng, B. ACM Computing Surveys, 56(9):1–36, October, 2024.

Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.

@article{zhao_deep_2024,
	title = {Deep {Multimodal} {Data} {Fusion}},
	volume = {56},
	issn = {0360-0300, 1557-7341},
	url = {https://dl.acm.org/doi/10.1145/3649447},
	doi = {10.1145/3649447},
	abstract = {Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.},
	language = {en},
	number = {9},
	urldate = {2025-03-09},
	journal = {ACM Computing Surveys},
	author = {Zhao, Fei and Zhang, Chengcui and Geng, Baocheng},
	month = oct,
	year = {2024},
	pages = {1--36},
}

Downloads: 0

{"_id":"uc2k8Lud3nynkEyjm","bibbaseid":"zhao-zhang-geng-deepmultimodaldatafusion-2024","author_short":["Zhao, F.","Zhang, C.","Geng, B."],"bibdata":{"bibtype":"article","type":"article","title":"Deep Multimodal Data Fusion","volume":"56","issn":"0360-0300, 1557-7341","url":"https://dl.acm.org/doi/10.1145/3649447","doi":"10.1145/3649447","abstract":"Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.","language":"en","number":"9","urldate":"2025-03-09","journal":"ACM Computing Surveys","author":[{"propositions":[],"lastnames":["Zhao"],"firstnames":["Fei"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Chengcui"],"suffixes":[]},{"propositions":[],"lastnames":["Geng"],"firstnames":["Baocheng"],"suffixes":[]}],"month":"October","year":"2024","pages":"1–36","bibtex":"@article{zhao_deep_2024,\n\ttitle = {Deep {Multimodal} {Data} {Fusion}},\n\tvolume = {56},\n\tissn = {0360-0300, 1557-7341},\n\turl = {https://dl.acm.org/doi/10.1145/3649447},\n\tdoi = {10.1145/3649447},\n\tabstract = {Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.},\n\tlanguage = {en},\n\tnumber = {9},\n\turldate = {2025-03-09},\n\tjournal = {ACM Computing Surveys},\n\tauthor = {Zhao, Fei and Zhang, Chengcui and Geng, Baocheng},\n\tmonth = oct,\n\tyear = {2024},\n\tpages = {1--36},\n}\n\n\n\n\n\n\n\n","author_short":["Zhao, F.","Zhang, C.","Geng, B."],"key":"zhao_deep_2024","id":"zhao_deep_2024","bibbaseid":"zhao-zhang-geng-deepmultimodaldatafusion-2024","role":"author","urls":{"Paper":"https://dl.acm.org/doi/10.1145/3649447"},"metadata":{"authorlinks":{}},"downloads":0,"html":""},"bibtype":"article","biburl":"https://bibbase.org/zotero/yywangvr","dataSources":["PW3eQRZmFcK6vLuar"],"keywords":[],"search_terms":["deep","multimodal","data","fusion","zhao","zhang","geng"],"title":"Deep Multimodal Data Fusion","year":2024}