LETR: A LIGHTWEIGHT AND EFFICIENT TRANSFORMER FOR KEYWORD SPOTTING
Kevin Ding, Martin Zong, Jiakui Li, Baoxiang Li
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:35
Transformer recently has achieved impressive success in a number of domains, including machine translation, image recognition, and speech recognition. Most of the previous work on keyword spotting (KWS) is built upon convolutional or recurrent neural networks. In this paper, we explore a family of Transformer architectures for keyword spotting, optimizing the trade-off between accuracy and efficiency in a high-speed regime. We also studied the effectiveness and summarized the principles of applying key components in Vision Transformers to KWS, including patch embedding, position encoding, attention mechanism and class token. On top of the findings, we propose the LeTR: a lightweight and highly efficient Transformer for KWS. We consider different measures of efficiency on different edge devices, so as to best reflect a wide range of application scenarios. Experimental results on two common benchmarks demonstrate that LeTR has achieved state-of-the-art results over competing methods with respect to the speed/accuracy trade-off.