Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:14:44
07 Oct 2022

Fine-grained visual classification (FGVC) targets to accurately identify the subordinate categories from a target class. Convolutional neural network (CNN) based methods prove that the attention mechanism can enhance the representation of local regions and improve the recognition accuracy. Recently, vision transformer (ViT) has shown great application potential in image classification tasks by taking advantage of its inherent self-attention mechanism and early global information acquisition capability. However, this global information acquisition approach involves an irrelevant environment in the interaction process, which makes it difficult for fine-grained tasks that rely on local differences to quickly learn discriminant features. To this end, we propose a hybrid network termed Mask-ViT, which can effectively avoid environmental interference and express more robust features by focusing on the instance itself. Specifically, Contour Knowledge Embedding (CKE) is employed to transferred prior location information to ViT and guided the subsequent recognition. The experiments on three benchmarks demonstrate the effectiveness of the proposed method.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00