Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition. Wu, Y., Daoudi, M., & Amad, A. IEEE Transactions on Affective Computing, 15(1):157–172, IEEE, 2024.
doi  abstract   bibtex   
Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model was proved to be more accurate and robust compared to fully-supervised methods on low data regimes.
@article{Wu2024,
abstract = {Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model was proved to be more accurate and robust compared to fully-supervised methods on low data regimes.},
archivePrefix = {arXiv},
arxivId = {2303.17611},
author = {Wu, Yujin and Daoudi, Mohamed and Amad, Ali},
doi = {10.1109/TAFFC.2023.3263907},
eprint = {2303.17611},
file = {:C\:/Users/fabie/OneDrive/Documents/Mendeley Desktop/Wu, Daoudi, Amad_2024_IEEE Transactions on Affective Computing.pdf:pdf},
issn = {19493045},
journal = {IEEE Transactions on Affective Computing},
keywords = {Emotion recognition,multimodal fusion,physiological signals,self-supervised learning,transformers},
number = {1},
pages = {157--172},
publisher = {IEEE},
title = {{Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition}},
volume = {15},
year = {2024}
}

Downloads: 0