Compression of User Generated Content Using Denoised References

Eduardo Pavez, Enrique Perez, Xin Xiong, Antonio Ortega, Balu Adsumilli

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:15:38

19 Oct 2022

Multi-modal learning with both text and images benefits multiple applications, such as attribute extraction for e-commerce products. in this paper, we propose Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new multi-modal architecture to jointly learn the fine-grained inter-modality relationship. It fuses CLIP with a sequence-wise attention module and a modality-wise attention module. The network uses CLIP to bridge the inter-modality gap at the global level, and uses the sequence-wise attention module to capture the fine-grained alignment between text and images. Besides, it leverages a modality-wise attention module to learn the relevance of each modality to downstream tasks, making the network robust against irrelevant modalities. CMA-CLIP outperforms the state-of-the-art method on Fashion-Gen by 5.5% in accuracy, achieves competitive performance on Food101 and performance on par with the state-of-the-art method on MM-IMDb. We also demonstrate CMA-CLIP?s robustness against irrelevant modalities on an Amazon dataset for the task of product attribute extraction.

Tags:

International Conference on Image Processing

IEEE ICIP 2022

icip

Compression of User Generated Content Using Denoised References

Eduardo Pavez, Enrique Perez, Xin Xiong, Antonio Ortega, Balu Adsumilli

Value-Added Bundle(s) Including this Product

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

More Like This

Self-Supervised Learning For Texture Classification Using Limited Labeled Data

informed Spatial Regularizations For Fast Fusion of Astronomical Images

Vessel Segmentation and Dirt/Reflection Detection For Retinal Fundus Photographs

Join an IEEE Society