Skip to main content

Compression of User Generated Content Using Denoised References

Eduardo Pavez, Enrique Perez, Xin Xiong, Antonio Ortega, Balu Adsumilli

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:15:38
19 Oct 2022

Multi-modal learning with both text and images benefits multiple applications, such as attribute extraction for e-commerce products. in this paper, we propose Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new multi-modal architecture to jointly learn the fine-grained inter-modality relationship. It fuses CLIP with a sequence-wise attention module and a modality-wise attention module. The network uses CLIP to bridge the inter-modality gap at the global level, and uses the sequence-wise attention module to capture the fine-grained alignment between text and images. Besides, it leverages a modality-wise attention module to learn the relevance of each modality to downstream tasks, making the network robust against irrelevant modalities. CMA-CLIP outperforms the state-of-the-art method on Fashion-Gen by 5.5% in accuracy, achieves competitive performance on Food101 and performance on par with the state-of-the-art method on MM-IMDb. We also demonstrate CMA-CLIP?s robustness against irrelevant modalities on an Amazon dataset for the task of product attribute extraction.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00