Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition

Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition. Wu, Y., Daoudi, M., & Amad, A. IEEE Transactions on Affective Computing, 15(1):157–172, IEEE, 2024.
doi abstract bibtex

Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model was proved to be more accurate and robust compared to fully-supervised methods on low data regimes.

@article{Wu2024,
abstract = {Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model was proved to be more accurate and robust compared to fully-supervised methods on low data regimes.},
archivePrefix = {arXiv},
arxivId = {2303.17611},
author = {Wu, Yujin and Daoudi, Mohamed and Amad, Ali},
doi = {10.1109/TAFFC.2023.3263907},
eprint = {2303.17611},
file = {:C\:/Users/fabie/OneDrive/Documents/Mendeley Desktop/Wu, Daoudi, Amad_2024_IEEE Transactions on Affective Computing.pdf:pdf},
issn = {19493045},
journal = {IEEE Transactions on Affective Computing},
keywords = {Emotion recognition,multimodal fusion,physiological signals,self-supervised learning,transformers},
number = {1},
pages = {157--172},
publisher = {IEEE},
title = {{Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition}},
volume = {15},
year = {2024}
}

Downloads: 0

{"_id":"49xHnvAGXdWyAmNKY","bibbaseid":"wu-daoudi-amad-transformerbasedselfsupervisedmultimodalrepresentationlearningforwearableemotionrecognition-2024","author_short":["Wu, Y.","Daoudi, M.","Amad, A."],"bibdata":{"bibtype":"article","type":"article","abstract":"Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model was proved to be more accurate and robust compared to fully-supervised methods on low data regimes.","archiveprefix":"arXiv","arxivid":"2303.17611","author":[{"propositions":[],"lastnames":["Wu"],"firstnames":["Yujin"],"suffixes":[]},{"propositions":[],"lastnames":["Daoudi"],"firstnames":["Mohamed"],"suffixes":[]},{"propositions":[],"lastnames":["Amad"],"firstnames":["Ali"],"suffixes":[]}],"doi":"10.1109/TAFFC.2023.3263907","eprint":"2303.17611","file":":C\\:/Users/fabie/OneDrive/Documents/Mendeley Desktop/Wu, Daoudi, Amad_2024_IEEE Transactions on Affective Computing.pdf:pdf","issn":"19493045","journal":"IEEE Transactions on Affective Computing","keywords":"Emotion recognition,multimodal fusion,physiological signals,self-supervised learning,transformers","number":"1","pages":"157–172","publisher":"IEEE","title":"Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition","volume":"15","year":"2024","bibtex":"@article{Wu2024,\nabstract = {Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model was proved to be more accurate and robust compared to fully-supervised methods on low data regimes.},\narchivePrefix = {arXiv},\narxivId = {2303.17611},\nauthor = {Wu, Yujin and Daoudi, Mohamed and Amad, Ali},\ndoi = {10.1109/TAFFC.2023.3263907},\neprint = {2303.17611},\nfile = {:C\\:/Users/fabie/OneDrive/Documents/Mendeley Desktop/Wu, Daoudi, Amad_2024_IEEE Transactions on Affective Computing.pdf:pdf},\nissn = {19493045},\njournal = {IEEE Transactions on Affective Computing},\nkeywords = {Emotion recognition,multimodal fusion,physiological signals,self-supervised learning,transformers},\nnumber = {1},\npages = {157--172},\npublisher = {IEEE},\ntitle = {{Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition}},\nvolume = {15},\nyear = {2024}\n}\n","author_short":["Wu, Y.","Daoudi, M.","Amad, A."],"key":"Wu2024","id":"Wu2024","bibbaseid":"wu-daoudi-amad-transformerbasedselfsupervisedmultimodalrepresentationlearningforwearableemotionrecognition-2024","role":"author","urls":{},"keyword":["Emotion recognition","multimodal fusion","physiological signals","self-supervised learning","transformers"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://nextcloud.univ-lille.fr/index.php/s/ZN4QZ7dH9iNaNWE/download/Interactions.bib","dataSources":["ECsXQHgbAjNJDKiJ7"],"keywords":["emotion recognition","multimodal fusion","physiological signals","self-supervised learning","transformers"],"search_terms":["transformer","based","self","supervised","multimodal","representation","learning","wearable","emotion","recognition","wu","daoudi","amad"],"title":"Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition","year":2024}