MIXED IN TIME AND MODALITY: CURSE OR BLESSING? CROSS-INSTANCE DATA AUGMENTATION FOR WEAKLY SUPERVISED MULTIMODAL TEMPORAL FUSION
Yonggang Zhu, Chao Tian, Zhuqing Jiang, Aidong Men, Haiying Wang, Qingchao Chen
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:07:21
In multimodal video event localization, we usually leverage feature fusion across different axes, such as the modality and temporal axes, for better context. To reduce the costs of detailed annotations, recent solutions explore weakly supervised settings. However, we observe that when feature fusion meets weakly supervised localization, problems can occur. It may cause ?feature cross-interference?, which produces a smearing effect on the localization result and can't be effectively supervised with conventional multiple instance learning loss. We verify it quantitatively on the audio-visual video parsing (AVVP) task, and propose a cross-instance data-augmentation framework, which can preserve the benefits of feature fusion while providing explicit feedbacks for feature cross-interference. We show that our method can enhance performance of existing models on two weakly supervised audio-visual localization tasks, i.e. AVVP and AVE.