TAQ: TOP-K ATTENTION-AWARE QUANTIZATION FOR VISION TRANSFORMERS
Lili Shi, Haiduo Huang, Bowei Song, Meng Tan, Wenzhe Zhao, Tian Xia, Pengju Ren
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Model quantization can reduce the memory footprint of the neural network and improve the computing efficiency. However, the sparse attention in Transformer models is difficult to quantize, the main challenge is that changing the order of attention values and shifting attention regions might lead to incorrect prediction results. To address this problem, we propose quantization method, termed TAQ, which uses the proposed TOP-K attention-aware loss to search the quantization parameters. Further, we combine the sequential and parallel quantization methods to optimize the procedure. We evaluate the generalization ability of TAQ on various vision Transformer variants, and its performance on image classification and object detection tasks. TAQ makes the TOP-K attention ranking more consistent before and after quantization, and significantly reduces the attention shifting rate, compared with PTQ4ViT, TAQ improves the performance by 0.66 and 0.45, respectively on ImageNet and COCO, achieves the state-of-the-art performance.