Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:07:36
22 Sep 2021

In this work, we propose a deep audio-visual fusion model to estimate the saliency of videos. The model extracts visual and audio features with two separate branches and fuses them to generate the saliency map. We design a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. Then a multi-scale audio-visual fusion method is used to integrate different modalities. Furthermore, we propose a new dataset for audio-visual saliency estimation. The proposed dataset consists of 202 high quality video squences with a large range of motions, scenes and object types. Many of the videos have high audio-visual correspondence. Several experiments are conducted on different datasets. The results demonstrate that our model outperforms the previous state-of-the-art methods by a large margin and the proposed dataset can serve as a new benchmark for the audio-visual saliency estimation task.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00