MMI-FER: Effective Facial Expression Recognition Through Multimodal Imaging for Traumatic Brain Injured Patient’s Rehabilitation

Published in Computer Vision, Imaging and Computer Graphics Theory and Applications — Communications in Computer and Information Science (CCIS), Vol. 997, Springer, 2019

Abstract

Accurately recognising the facial expressions of Traumatic Brain Injured (TBI) patients is essential for enabling affect-aware therapeutic robot interaction. However, the impaired, atypical, and highly variable facial expressions of TBI patients make this a substantially harder problem than standard FER on healthy subjects — particularly when relying on RGB imagery alone.

This paper extends earlier work by proposing a multimodal imaging pipeline that combines RGB, depth (RGBD), and thermal infrared modalities to achieve more robust FER for TBI patients in rehabilitation settings. The multi-channel approach compensates for limitations of each individual modality: RGB captures appearance, depth provides 3D geometry robust to pose, and thermal captures physiological signals invisible in colour images. A fusion architecture integrates features across modalities to produce more reliable affective state estimates under the challenging conditions of clinical rehabilitation.

Experiments conducted on the TBI database collected from a neurological rehabilitation centre demonstrate that multimodal imaging outperforms unimodal RGB-only approaches, contributing to a more reliable pipeline for robot-assisted therapy.

Key Contributions

  • First multimodal imaging approach (RGB + depth + thermal) for FER with TBI patients
  • Demonstrates consistent gains of multimodal fusion over RGB-only baselines in clinical settings
  • Provides a systematic comparison of modality contributions for impaired facial expression recognition
  • Establishes a stronger foundation for affect-aware robotic rehabilitation systems targeting cognitively impaired users

Methodology

  1. Multimodal Data Acquisition — synchronised RGB, depth, and thermal streams from rehabilitation sessions
  2. Modality-Specific Feature Extraction — dedicated CNNs per channel exploiting each modality’s strengths
  3. Cross-Modal Fusion — late fusion combining modality-level predictions with learned weights
  4. Evaluation — systematic ablation across modality subsets to quantify individual and combined contributions

Publication Details

This is an extended and revised version of the VISAPP 2018 paper, published in the Communications in Computer and Information Science (CCIS) series by Springer as part of the VISIGRAPP post-conference proceedings.

Files