Differentiable Dynamic Channel Association For Knowledge Distillation
Qiankun Tang, Xiaogang Xu, Jun Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:52
Knowledge distillation is an effective model compression technology, which encourages a small student model to mimic the features or probabilistic outputs of a large teacher model. Existing feature-based distillation methods mainly focus on formulating enriched representations, while naively address the channel dimension gap and adopt the handcrafted channel association strategy between teacher and student for distillation. This not only introduces more parameters and computational cost, but may transfer irrelevant information to student. In this paper, we present a differentiable and efficient Dynamic Channel Association (DCA) mechanism, which automatically associates proper teacher channels for each student channel. DCA also enables each student channel to distill knowledge from multiple teacher channels in a weighted manner. Extensive experiments on classification task, with various combinations of network architectures for teacher and student models, well demonstrate the effectiveness of our proposed approach.