Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:14:50
11 May 2022

The process of perception to affective response of humans is gated by a bottom-up saliency mechanism at the sensory level. In specifics, auditory saliency emphasizes audio segments that need to be attended to cognitively appraise and experience emotion. In this work, inspired by this mechanism, we propose an end-to-end feature masking network for audio emotion recognition in movies. Our proposed Audio-Saliency Masking Transformer (ASTM) adjusts feature embedding using two learnable masks; one of them cross-refers to an auditory saliency map, and the other one is through self-reference. By joint training for front-end mask gating and the transformer as the back-end emotion classifier, we achieve three-class UARs improvement of 1.74%, 1.27%, 0.95%, 0.82% when comparing to the best of the other models on experienced arousal, experienced valence, intended arousal, and intended valence, respectively. We further analyze which acoustic feature categories that our saliency mask attends to the most.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00