BodyEmotion: Deep Emotion Recognition through Upper Body Movements and Facial Expression
Published in 16th International Conference on Computer Vision Theory and Applications (VISAPP 2021), SciTePress, 2021

Abstract
Human emotion recognition in Human-Robot Interaction (HRI) contexts must contend with significant variability in how emotion is expressed — through facial configuration, body posture, and gesture. While facial expression has long been the dominant modality, upper body movements provide complementary and often more reliable affective cues, particularly for users with restricted facial expressivity.
This paper presents a deep convolutional neural network that learns a correspondence between upper body movements and facial expressions, enabling emotion and gesture recognition from either modality independently once this cross-modal mapping is established. The model is trained on benchmark datasets exhibiting diverse emotion categories and corresponding body movement patterns. Once the mapping is learned, the system can infer body-level emotional context from facial features alone — a property of practical value in scenarios where full-body capture is unavailable.
Experiments on standard benchmark datasets demonstrate that the joint body-face model achieves competitive emotion recognition performance while providing additional gesture-level inference capability not available in face-only systems.
Key Contributions
- Cross-modal deep architecture learning correspondences between facial expressions and upper body movement patterns
- Enables emotion recognition from facial features that generalises to body-level gesture states
- Evaluated on established benchmark datasets across multiple emotion categories
- Directly relevant to HRI scenarios where full-body tracking is unreliable or unavailable
Architecture
The model is structured as a two-stream deep CNN:
- Facial Stream — spatial CNN processing normalised face regions
- Body Movement Stream — temporal CNN over optical flow or pose sequences capturing upper-body dynamics
- Cross-Modal Embedding Layer — shared representation space learning facial–body correspondences
- Emotion Classifier — joint softmax head over fused representations
At inference, only the facial stream is required, with the body stream providing supervision during training through the shared embedding.
Venue
Presented at the 16th International Conference on Computer Vision Theory and Applications (VISAPP 2021), part of VISIGRAPP, Online conference, February 8–10, 2021.
