Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Lecture 09 Oct 2023

Self-supervised learning with vision transformer (ViT) has gained much attention recently. Most existing methods rely on either contrastive learning or masked image modeling. The former is suitable for global feature extraction but underperforms in fine-grained tasks. The later explores the internal structure of images but ignores the high information sparsity and unbalanced information distribution. In this paper, we propose a new approach called Attention-guided Contrastive Masked Image Modeling (ACoMIM), which integrates the merits of both paradigms and leverages the attention mechanism of ViT for effective representation. Specifically, it has two pretext tasks, predicting the features of masked regions guided by attention and comparing the global features of masked and unmasked images. We show that these two pretext tasks complement each other and improve our method's performance. The experiments demonstrate that our model transfers well to various downstream tasks such as classification and object detection.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • CIS
    Members: Free
    IEEE Members: Free
    Non-members: Free