Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:07:21
13 May 2022

In multimodal video event localization, we usually leverage feature fusion across different axes, such as the modality and temporal axes, for better context. To reduce the costs of detailed annotations, recent solutions explore weakly supervised settings. However, we observe that when feature fusion meets weakly supervised localization, problems can occur. It may cause ?feature cross-interference?, which produces a smearing effect on the localization result and can't be effectively supervised with conventional multiple instance learning loss. We verify it quantitatively on the audio-visual video parsing (AVVP) task, and propose a cross-instance data-augmentation framework, which can preserve the benefits of feature fusion while providing explicit feedbacks for feature cross-interference. We show that our method can enhance performance of existing models on two weakly supervised audio-visual localization tasks, i.e. AVVP and AVE.

More Like This

  • CIS
    Members: Free
    IEEE Members: Free
    Non-members: Free
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00