Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Lecture 09 Oct 2023

Most of the existing dense video captioning models use a single modality of features for captioning. A video has a wide variety of information like spatial features, temporal features, audio features, and semantic features. In this paper, we propose a dense video captioning model that captures cross-modal attention between different types of features using an audio-visual attention block in the encoder and a hierarchical attention block in the decoder. The audio-visual attention block applies cross-modal attention between the RGB, flow, and audio features. The hierarchical attention block performs two-level attention between the semantic features and the features from the encoder for generating descriptions. The results show that the proposed approach performs better than the state-of-the-art approaches.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00