Background-Tolerant Object Classification With Embedded Segmentation Mask For infrared and Color Imagery
Maliha Arif, Calvin Yong, Abhijit Mahalanobis, Nazanin Rahnavard
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:16:46
Holistic understanding of videos requires the recognition of the overall scene beyond detecting foreground activity and objects. It provides valuable information for various video understanding tasks such as video summarization, scene change detection and content filtering. While significant effort has been put into developing models for scene classification in images (e.g. Places365), video-level scene recognition is relatively nascent. The scope of this paper is to address this problem of going from image representations to video for scene classification. in particular, we compare self-supervised deep learning methods on video scene recognition task using the HVU dataset. Starting from strong image level scene representations, with triplets based contrastive loss, we train a video-level scene classifier. We propose triplet sampling strategies that aid the self-supervision. We compare the self-supervised techniques against the image level scene representations, as well as a weakly supervised classifier trained on image labels. We observe that the models learned using self-supervised method outperform both baselines (with statistical significance), showing that we are able to retain the representative power of the video-level scene representations compared to a competitive image-level scene recognition model trained on Places365, while showing benefits over weakly supervised techniques.