Multi-scale temporal feature fusion for few-shot action recognition
Jun-Tae Lee, Sungrack Yun
-
SPS
IEEE Members: $11.00
Non-members: $15.00
The aim of this paper is to recognize actions of interest that are given by a few support videos in testing (query) videos. The focus of our approach is to develop a novel temporal enrichment module where the features describing local temporal contexts in videos are enhanced by collaboratively merging important information in frame-level (no temporal context) features. We call this module a multi-scale temporal feature fusion (MSTFF) module. Utilizing multiple MSTFF modules varying the scope of local temporal context extraction, we can obtain discriminative video representation which is crucial in the few-shot tasks where support videos are not sufficient to describe an action class. For stable learning of a model with MSTFF and the performance boost, we also learn a local temporal context-level auxiliary classifier in parallel with the main classifier. We analyze the proposed components to demonstrate their importance. We achieve state-of-the-art on three few-shot action recognition benchmarks: Something-Something V2 (SSv2), HMDB51, and Kinetics.