Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Poster 11 Oct 2023

Existing studies in Human-Object Interaction (HOI) classification rely on costly human-annotated labels. The goal of this paper is to study a new zero-shot setup to remove the dependency on ground-truth labels. We propose a novel Heterogenous Teacher-Student (HTS) framework and a new loss function. HTS employs a generative pre-trained image captioner as the teacher and a contrastive pre-trained classifier as the student. HTS combines the discriminability from generative pre-training and efficiency from contrastive pre-training. To facilitate learning of HOI in this setup, we introduce pseudo-label filtering which aggregates HOI probabilities from multiple regional captions to supervise the student. To enhance the multi-label learning of the student on few-shot classes, we design LogSumExp (LSE)-Sign loss which features a dynamic gradient re-weighting mechanism. Eventually, the student achieves 49.6 mAP on the HICO dataset without using ground truth, becoming a new state-of-the-art method that outperforms supervised approaches. Code is available.