MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
Hemalatha Munusamy, Chandra Sekhar C
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Most of the existing dense video captioning models use a single modality of features for captioning. A video has a wide variety of information like spatial features, temporal features, audio features, and semantic features. In this paper, we propose a dense video captioning model that captures cross-modal attention between different types of features using an audio-visual attention block in the encoder and a hierarchical attention block in the decoder. The audio-visual attention block applies cross-modal attention between the RGB, flow, and audio features. The hierarchical attention block performs two-level attention between the semantic features and the features from the encoder for generating descriptions. The results show that the proposed approach performs better than the state-of-the-art approaches.