Benchmarking Lf-Mmi, Ctc And Rnn-T Criteria For Streaming Asr
Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:13:53
In this work, we perform comprehensive evaluations on automatic speech recognition (ASR) accuracy and efficiency with three popular training criteria for latency-controlled streaming ASR application: LF-MMI, CTC and RNN-T. In recognizing challenging social media videos of 7 languages, with training data sized from 3K to 14K hours, we conduct large-scale controlled experimentation across each training criterion with identical datasets and encoder model architecture, and found out that RNN-T models have consistent advantage in word error rates (WER) and CTC models have consistent advantage in inference efficiency measured by real-time factor (RTF). Additionally for different training criteria, we selectively examine various modeling strategies including modeling units, encoder architectures, pre-training, etc. To our best knowledge, this is the first comprehensive benchmark on these three widely-used ASR training criteria on real-world streaming ASR applications over multiple languages.