Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 0:11:00
28 Jun 2022

Gesture recognition, i.e., classification of videos depicting humans who perform hand gestures, is essential for Human-Computer Interaction. To this end, coupled Convolutional Neural Networks-Long Short-Term Memory architectures (CNN-LSTMs) are employed for fast semantic video analysis, but the typical transfer learning approach of initializing the CNN backbone using pretraining for whole-image classification is not necessarily ideal for spatiotemporal video understanding tasks. This paper investigates self-supervised CNN pretraining for a novel pretext task, relying on spatiotemporal video frame corruption via a set of low-level image/video processing building blocks that jointly force the CNN to learn to complete missing content. This is likely to coincide with visible moving object boundaries, including human body silhouettes. Such a CNN parameter set initialization is then able to augment gesture recognition performance, after retraining for this video classification downstream task, without inducing any runtime overhead during the inference stage. Evaluation on a gesture recognition dataset for autonomous Unmanned Aerial Vehicle (UAV) handling demonstrates the effectiveness of the proposed method, against both traditional ImageNet initialization and a competing self-supervised pretext task-based initialization.