Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling . Ghaleb, E., Burenko, I., Rasenberg, M., Pouw, W., Uhrig, P., Holler, J., Toni, I., Özyürek, A., & Fernández, R. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.

Paper

Supp

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling [link]

Github abstract bibtex 20 downloads

Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and re- traction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face- to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline mod- els in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework’s capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.

@inproceedings{ghaleb-etal-2024-wacv,
    title = "Co-Speech Gesture Detection through Multi-Phase Sequence Labeling ",
    author = "Esam Ghaleb and Ilya Burenko and Marlou Rasenberg and Wim Pouw and Peter Uhrig and Judith Holler and Ivan Toni and Asl\i \"Ozy\"urek and Raquel Fern{\'a}ndez",
    booktitle = "Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)",
    year = "2024",
    url ="../papers/GhalebEtAl-WAVC2024.pdf",
    url_supp = "../papers/GhalebEtAl-WAVC2024-supp.pdf",
    url_github ="https://github.com/EsamGhaleb/Multi-Phase-Gesture-Detection",
    abstract = "Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and re- traction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face- to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline mod- els in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework’s capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis."
}

Downloads: 20

{"_id":"yWtjGawrsWnFtJcvh","bibbaseid":"ghaleb-burenko-rasenberg-pouw-uhrig-holler-toni-zyrek-etal-cospeechgesturedetectionthroughmultiphasesequencelabeling-2024","author_short":["Ghaleb, E.","Burenko, I.","Rasenberg, M.","Pouw, W.","Uhrig, P.","Holler, J.","Toni, I.","Özyürek, A.","Fernández, R."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"Co-Speech Gesture Detection through Multi-Phase Sequence Labeling ","author":[{"firstnames":["Esam"],"propositions":[],"lastnames":["Ghaleb"],"suffixes":[]},{"firstnames":["Ilya"],"propositions":[],"lastnames":["Burenko"],"suffixes":[]},{"firstnames":["Marlou"],"propositions":[],"lastnames":["Rasenberg"],"suffixes":[]},{"firstnames":["Wim"],"propositions":[],"lastnames":["Pouw"],"suffixes":[]},{"firstnames":["Peter"],"propositions":[],"lastnames":["Uhrig"],"suffixes":[]},{"firstnames":["Judith"],"propositions":[],"lastnames":["Holler"],"suffixes":[]},{"firstnames":["Ivan"],"propositions":[],"lastnames":["Toni"],"suffixes":[]},{"firstnames":["Aslı"],"propositions":[],"lastnames":["Özyürek"],"suffixes":[]},{"firstnames":["Raquel"],"propositions":[],"lastnames":["Fernández"],"suffixes":[]}],"booktitle":"Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","year":"2024","url":"../papers/GhalebEtAl-WAVC2024.pdf","url_supp":"../papers/GhalebEtAl-WAVC2024-supp.pdf","url_github":"https://github.com/EsamGhaleb/Multi-Phase-Gesture-Detection","abstract":"Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and re- traction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face- to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline mod- els in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework’s capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.","bibtex":"@inproceedings{ghaleb-etal-2024-wacv,\n title = \"Co-Speech Gesture Detection through Multi-Phase Sequence Labeling \",\n author = \"Esam Ghaleb and Ilya Burenko and Marlou Rasenberg and Wim Pouw and Peter Uhrig and Judith Holler and Ivan Toni and Asl\\i \\\"Ozy\\\"urek and Raquel Fern{\\'a}ndez\",\n booktitle = \"Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\",\n year = \"2024\",\n url =\"../papers/GhalebEtAl-WAVC2024.pdf\",\n url_supp = \"../papers/GhalebEtAl-WAVC2024-supp.pdf\",\n url_github =\"https://github.com/EsamGhaleb/Multi-Phase-Gesture-Detection\",\n abstract = \"Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and re- traction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face- to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline mod- els in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework’s capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.\"\n}\n\n\n","author_short":["Ghaleb, E.","Burenko, I.","Rasenberg, M.","Pouw, W.","Uhrig, P.","Holler, J.","Toni, I.","Özyürek, A.","Fernández, R."],"key":"ghaleb-etal-2024-wacv","id":"ghaleb-etal-2024-wacv","bibbaseid":"ghaleb-burenko-rasenberg-pouw-uhrig-holler-toni-zyrek-etal-cospeechgesturedetectionthroughmultiphasesequencelabeling-2024","role":"author","urls":{"Paper":"https://raw.githubusercontent.com/dmg-illc/dmg/master/papers/GhalebEtAl-WAVC2024.pdf"," supp":"https://raw.githubusercontent.com/dmg-illc/dmg/master/papers/GhalebEtAl-WAVC2024-supp.pdf"," github":"https://github.com/EsamGhaleb/Multi-Phase-Gesture-Detection"},"metadata":{"authorlinks":{}},"downloads":20},"bibtype":"inproceedings","biburl":"https://raw.githubusercontent.com/dmg-illc/dmg/master/bibbase/dmg-publications.bib","dataSources":["sXg85ARyqxCxh4Dnn","42BzLCGMaDCpG5rq2"],"keywords":[],"search_terms":["speech","gesture","detection","through","multi","phase","sequence","labeling","ghaleb","burenko","rasenberg","pouw","uhrig","holler","toni","özyürek","fernández"],"title":"Co-Speech Gesture Detection through Multi-Phase Sequence Labeling ","year":2024,"downloads":20}