AN AUDIO-SALIENCY MASKING TRANSFORMER FOR AUDIO EMOTION CLASSIFICATION IN MOVIEs

Ya-Tse Wu, Jeng-Lin Li, Chi-Chun Lee

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:50

11 May 2022

The process of perception to affective response of humans is gated by a bottom-up saliency mechanism at the sensory level. In specifics, auditory saliency emphasizes audio segments that need to be attended to cognitively appraise and experience emotion. In this work, inspired by this mechanism, we propose an end-to-end feature masking network for audio emotion recognition in movies. Our proposed Audio-Saliency Masking Transformer (ASTM) adjusts feature embedding using two learnable masks; one of them cross-refers to an auditory saliency map, and the other one is through self-reference. By joint training for front-end mask gating and the transformer as the back-end emotion classifier, we achieve three-class UARs improvement of 1.74%, 1.27%, 0.95%, 0.82% when comparing to the best of the other models on experienced arousal, experienced valence, intended arousal, and intended valence, respectively. We further analyze which acoustic feature categories that our saliency mask attends to the most.

Tags:

auditory saliency

affective multimedia

emotion recognition

transformer

AN AUDIO-SALIENCY MASKING TRANSFORMER FOR AUDIO EMOTION CLASSIFICATION IN MOVIEs

Ya-Tse Wu, Jeng-Lin Li, Chi-Chun Lee

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Slides: Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

TRANSPOINTFLOW: LEARNING SCENE FLOW FROM POINT CLOUDS WITH TRANSFORMER

Join an IEEE Society