Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:14:25
10 May 2022

A challenging task in audiovisual emotion recognition is to implement neural network architectures that can leverage and fuse multimodal information while temporally aligning modalities, handling missing modalities, and capturing information from all modalities without losing information during training. These requirements are important to achieve model robustness and to increase accuracy on the emotion recognition task. A recent approach to perform multimodal fusion is to use the transformer architecture to properly fuse and align the modalities. This study proposes the AuxFormer framework, which addresses in a principled way the aforementioned challenges. AuxFormer combines the transformer framework with auxiliary networks. It uses shared losses to infuse information from single-modality networks that are separately embedded. The extra layer of audiovisual information added to our main network retains information that would otherwise be lost during training. Results show that the AuxFormer architecture performs 6.8% to 7.2% better on the CREMA-D corpus and 2.3% to 3.5% better on the MSP-IMPROV corpus than state-of-the-art baselines, indicating that our framework benefits from auxiliary networks. We also show that under non-ideal conditions (e.g., missing modalities) our architecture is able to sustain strong performance under audio-only and video-only scenarios, benefiting from the optimized training strategy explored in this study.

More Like This

  • PES
    Members: Free
    IEEE Members: Free
    Non-members: Free
  • SPS
    Members: $10.00
    IEEE Members: $22.00
    Non-members: $30.00
  • CIS
    Members: Free
    IEEE Members: Free
    Non-members: Free