Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Poster 11 Oct 2023

People navigate a world that involves many different modalities and make decision on what they observe. Many of the classification problems that we face in the modern digital world are also multimodal in nature where textual information on the web rarely occurs alone, and is often accompanied by images, sounds, or videos. The use of transformers in deep learning tasks has proven to be highly effective. However, the relationship between different modalities remains unclear. This paper investigates ways to simultaneously utilize self-attention over both text and vision modalities. We propose a novel architecture that combines the strengths of both modalities. We show that combining a text model with a fixed image model leads to the best classification performance. Additionally, we incorporate a late fusion technique to enhance the architecture's ability to capture multiple modalities. Our experiments demonstrate that our proposed method outperforms state-of-the-art baselines on Food101, MM-IMDB, and FashionGen datasets.

More Like This

  • PES
    Members: Free
    IEEE Members: Free
    Non-members: Free
  • SPS
    Members: $10.00
    IEEE Members: $22.00
    Non-members: $30.00
  • CIS
    Members: Free
    IEEE Members: Free
    Non-members: Free