Capsule Transformer Network for Dynamic Hand Gesture Recognition using Multimodal Data
Alexandre Lebas, Rim Slama, Hazem Wannous
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In recent years, deep learning techniques have achieved remarkable success in video analysis and more especially in action and gesture recognition. Even though convolutional neural networks (CNNs) remain the most widely used models, they have difficulty in capturing the global contextual information involving spatial and temporal domains or inter-modality due to the local feature learning mechanism. This paper introduces a Capsule Transformer Network, which composed of a frame capsule module for extracting hand features and a gesture transformer module for modeling the temporal features and recognizing the dynamic gesture. Spatial attention is ensured through the capsule module to enhance the spatial information of the hand image, while the transformer module guarantees temporal attention through gesture sequence. We propose to use multimodal data, including RGB, depth and IR data, which improves the accuracy of our approach as it better captures the 3D structure of the hand and can distinguish between similar hand gestures. Testing on two datasets, Briareo and SHREC17, the proposed approach outperforms or equals previous methods.