NAVIGATING AUDIO-VISUAL EVENT DETECTION ACROSS MISMATCHED MODALITIES
Guangwei Li, Xuenan Xu, Mengyue Wu, Kai Yu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:12:45
Previous audio-visual (AV) alignment mainly focuses on frame-level synchronization while neglecting clip-wise matching. We focus on AV parsing on fully unconstrained data where the audio and visual events do not necessarily co-present. A video-enhanced Audioset dataset is provided to investigate parsing on such a mismatching setting, with 376 events included. To our knowledge, this is the first time where AV event parsing and detection are inspected on a clip-wise matching scenario. Experiments show that our proposed method largely improves video parsing accuracy on tagging and detection. Further, a parsing model pretrained on our dataset can assist in accurately locating audio-visual syncing time spans.