MIXED IN TIME AND MODALITY: CURSE OR BLESSING? CROSS-INSTANCE DATA AUGMENTATION FOR WEAKLY SUPERVISED MULTIMODAL TEMPORAL FUSION

Yonggang Zhu, Chao Tian, Zhuqing Jiang, Aidong Men, Haiying Wang, Qingchao Chen

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:21

13 May 2022

In multimodal video event localization, we usually leverage feature fusion across different axes, such as the modality and temporal axes, for better context. To reduce the costs of detailed annotations, recent solutions explore weakly supervised settings. However, we observe that when feature fusion meets weakly supervised localization, problems can occur. It may cause ?feature cross-interference?, which produces a smearing effect on the localization result and can't be effectively supervised with conventional multiple instance learning loss. We verify it quantitatively on the audio-visual video parsing (AVVP) task, and propose a cross-instance data-augmentation framework, which can preserve the benefits of feature fusion while providing explicit feedbacks for feature cross-interference. We show that our method can enhance performance of existing models on two weakly supervised audio-visual localization tasks, i.e. AVVP and AVE.

Tags:

weakly supervised

feature cross-interference

audio-visual localization

data augmentation

MIXED IN TIME AND MODALITY: CURSE OR BLESSING? CROSS-INSTANCE DATA AUGMENTATION FOR WEAKLY SUPERVISED MULTIMODAL TEMPORAL FUSION

Yonggang Zhu, Chao Tian, Zhuqing Jiang, Aidong Men, Haiying Wang, Qingchao Chen

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

KEYNOTE: Keras, A shortcut to master AI

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

Slides: BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

Join an IEEE Society