Zero-shot Human-Object Interaction (HOI) Classification by Bridging Generative and Contrastive Image-Language Models
Ying Jin, Yinpeng Chen, Jianfeng Wang, Lijuan Wang, Jenq-Neng Hwang, Zicheng Liu
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Existing studies in Human-Object Interaction (HOI) classification rely on costly human-annotated labels. The goal of this paper is to study a new zero-shot setup to remove the dependency on ground-truth labels. We propose a novel Heterogenous Teacher-Student (HTS) framework and a new loss function. HTS employs a generative pre-trained image captioner as the teacher and a contrastive pre-trained classifier as the student. HTS combines the discriminability from generative pre-training and efficiency from contrastive pre-training. To facilitate learning of HOI in this setup, we introduce pseudo-label filtering which aggregates HOI probabilities from multiple regional captions to supervise the student. To enhance the multi-label learning of the student on few-shot classes, we design LogSumExp (LSE)-Sign loss which features a dynamic gradient re-weighting mechanism. Eventually, the student achieves 49.6 mAP on the HICO dataset without using ground truth, becoming a new state-of-the-art method that outperforms supervised approaches. Code is available.