-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:14:44
Fine-grained visual classification (FGVC) targets to accurately identify the subordinate categories from a target class. Convolutional neural network (CNN) based methods prove that the attention mechanism can enhance the representation of local regions and improve the recognition accuracy. Recently, vision transformer (ViT) has shown great application potential in image classification tasks by taking advantage of its inherent self-attention mechanism and early global information acquisition capability. However, this global information acquisition approach involves an irrelevant environment in the interaction process, which makes it difficult for fine-grained tasks that rely on local differences to quickly learn discriminant features. To this end, we propose a hybrid network termed Mask-ViT, which can effectively avoid environmental interference and express more robust features by focusing on the instance itself. Specifically, Contour Knowledge Embedding (CKE) is employed to transferred prior location information to ViT and guided the subsequent recognition. The experiments on three benchmarks demonstrate the effectiveness of the proposed method.