AN AUDIO-SALIENCY MASKING TRANSFORMER FOR AUDIO EMOTION CLASSIFICATION IN MOVIEs
Ya-Tse Wu, Jeng-Lin Li, Chi-Chun Lee
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:14:50
The process of perception to affective response of humans is gated by a bottom-up saliency mechanism at the sensory level. In specifics, auditory saliency emphasizes audio segments that need to be attended to cognitively appraise and experience emotion. In this work, inspired by this mechanism, we propose an end-to-end feature masking network for audio emotion recognition in movies. Our proposed Audio-Saliency Masking Transformer (ASTM) adjusts feature embedding using two learnable masks; one of them cross-refers to an auditory saliency map, and the other one is through self-reference. By joint training for front-end mask gating and the transformer as the back-end emotion classifier, we achieve three-class UARs improvement of 1.74%, 1.27%, 0.95%, 0.82% when comparing to the best of the other models on experienced arousal, experienced valence, intended arousal, and intended valence, respectively. We further analyze which acoustic feature categories that our saliency mask attends to the most.