MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Wujiang Xu (Xi'an Jiaotong University); Runzhong Wang (Shanghai Jiao Tong University); xiaobo guo (antgroup); Shaoshuai Li (Ant Group); Qiongxu Ma (Ant Group); Yunan Zhao (Ant Group); Sheng Guo (Ant Group); Zhenfeng Zhu (bjtu); Junchi Yan (Shanghai Jiao Tong University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Video summarization is an essential problem in signal processing, which intends to produce a concise summary of the original video. Existing video summarization approaches regard the task as a keyframe selection problem and generally construct the frame-wise representation by combining the long-range temporal dependency with either unimodal or bimodal information. The optimal keyframe should offer the semantic summarization of the whole content by exploiting the multimodal and shot-level hierarchical natures of videos, however, such natures are not fully exploited in existing methods. In this paper, we propose to construct a more powerful and robust frame-wise representation and predict the frame-level importance score in a fair and comprehensive manner. Specifically, we propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation via combining the comprehensive available multimodal information. We further design a hierarchical ShotConv network to incorporate the adaptive shot-aware frame-level representation by considering the short-range and long-range temporal dependencies. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video. Extensive experiments on two standard video summarization datasets demonstrate that our proposed method consistently outperforms state-of-the-arts.