Temporal Early Exiting for Streaming Speech Commands Recognition
Raphael Tang, Karun Kumar, Piyush Vyas, Gefei Yang, Yajie Mao, Craig Murray, Ji Xin, Jimmy Lin, Wenyan Li
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:07:21
Limited-vocabulary speech commands recognition is the task of classifying a short utterance as one of several speech commands, for which neural networks obtain state-of-the-art results. In particular, recurrent neural networks represent a common approach for streaming commands recognition systems. In this paper, we explore resource-efficient methods to short-circuit such systems in the time domain when the model is confident in its prediction. We propose applying a frame-level labeling objective to further improve the efficiency-accuracy trade-off. On two datasets in limited-vocabulary commands recognition, our best method achieves an average time savings of 45% of the utterance without reducing the absolute accuracy by more than 0.6 points. We show that the per-instance savings depend on the length of the unique prefix in the phonemes across a dataset.