ATTENTION-GUIDED CONTRASTIVE MASKED IMAGE MODELING FOR TRANSFORMER-BASED Self-SUPERVISED LEARNING

Yucheng Zhan, Yucheng Zhao, Chong Luo, Yueyi Zhang, Xiaoyan Sun

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Lecture 09 Oct 2023

Self-supervised learning with vision transformer (ViT) has gained much attention recently. Most existing methods rely on either contrastive learning or masked image modeling. The former is suitable for global feature extraction but underperforms in fine-grained tasks. The later explores the internal structure of images but ignores the high information sparsity and unbalanced information distribution. In this paper, we propose a new approach called Attention-guided Contrastive Masked Image Modeling (ACoMIM), which integrates the merits of both paradigms and leverages the attention mechanism of ViT for effective representation. Specifically, it has two pretext tasks, predicting the features of masked regions guided by attention and comparing the global features of masked and unmasked images. We show that these two pretext tasks complement each other and improve our method's performance. The experiments demonstrate that our model transfers well to various downstream tasks such as classification and object detection.

Tags:

self-supervised learning

vision transformer

Masked Image Modeling

ATTENTION-GUIDED CONTRASTIVE MASKED IMAGE MODELING FOR TRANSFORMER-BASED Self-SUPERVISED LEARNING

Yucheng Zhan, Yucheng Zhao, Chong Luo, Yueyi Zhang, Xiaoyan Sun

More Like This

Slides: The Changing Landscape of Speech Foundation Models

The Changing Landscape of Speech Foundation Models

Methods for Learning with Few Data Slides

Join an IEEE Society