Deep Audio-Visual Fusion Neural Network For Saliency Estimation
Shunyu Yao, Xiongkuo Min, Guangtao Zhai
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:07:36
In this work, we propose a deep audio-visual fusion model to estimate the saliency of videos. The model extracts visual and audio features with two separate branches and fuses them to generate the saliency map. We design a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. Then a multi-scale audio-visual fusion method is used to integrate different modalities. Furthermore, we propose a new dataset for audio-visual saliency estimation. The proposed dataset consists of 202 high quality video squences with a large range of motions, scenes and object types. Many of the videos have high audio-visual correspondence. Several experiments are conducted on different datasets. The results demonstrate that our model outperforms the previous state-of-the-art methods by a large margin and the proposed dataset can serve as a new benchmark for the audio-visual saliency estimation task.