MULTI-SCALE COMPOSITIONAL CONSTRAINTS FOR REPRESENTATION LEARNING ON VIDEOS
Georgios Paraskevopoulos (National Technical University of Athens); Chandrashekhar Lavania (AWS AI Labs); Lovish Chum (Amazon Inc.); Shiva Sundaram (Amazon)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Combining simple concepts to form structured thoughts and decomposing complex concepts into their constituents is one key characteristic of human cognition. In this work we extract video representations by combining multi-scale processing with compositional constraints, i.e., we constrain the latent space created by the network so that coarse grained video features are composed from a set of fine-grained video features using simple functions. We integrate the proposed constraints in a state-of-the-art contrastive learning framework. In our ablations, we evaluate different formulations of the compositional constraints and composition functions. We evaluate the proposed approach for the downstream tasks of action detection in UCF-101, and video summarization in the SumMe dataset. We achieve significant improvements over the baseline, i.e., 3.9% and 6.3% relative improvements for UCF-101 and SumMe respectively, showcasing the importance of compositional video representations.