An Attention-Based Joint Acoustic And Text On-Device End-To-End Model

Tara Sainath, Ruoming Pang, Ron Weiss, Yanzhang He, Chung-cheng Chiu, Trevor Strohman

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 14:16

04 May 2020

Recently, we introduced a two-pass on-device end-to-end (E2E) speech recognition model, which runs RNN-T in the first-pass and then rescores/redecodes the result using a noncausal Listen, Attend and Spell (LAS) decoder. This on-device model obtained similar performance to a state-of-the-art conventional model. However, like many E2E models, it suffers from being trained only on supervised audio-text pairs and thus performs poorly on rare words compared to a conventional model which incorporates a language model trained on a much larger text corpus. In this work, we introduce a joint acoustic and text decoder (JATD) into the LAS decoder, which makes it possible to incorporate a much larger text corpus into training. We find that the JATD model obtains in a 3-10% relative improvement in WER compared to a LAS decoder trained only on supervised audio-text pairs across a variety of proper noun test sets.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

An Attention-Based Joint Acoustic And Text On-Device End-To-End Model

Tara Sainath, Ruoming Pang, Ron Weiss, Yanzhang He, Chung-cheng Chiu, Trevor Strohman

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join an IEEE Society