Tts4pretrain 2.0: Advancing the use of text and speech in ASR pretraining with consistency and contrastive losses

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Gary Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:11

11 May 2022

An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical representations derived from synthesized speech was introduced in tts4pretrain. However, the representations learned from synthesized and real speech are likely to be different, potentially limiting the improvements from incorporating unspoken text. In this paper, we introduce learning from supervised speech earlier on in the training process with consistency-based regularization between real and synthesized speech. This allows for better learning of shared speech and text representations. Thus, we introduce a new objective, with encoder and decoder consistency and contrastive regularization between real and synthesized speech derived from the labeled corpora during the pretraining stage. We show that the new objective leads to more similar representations derived from speech and text that helps downstream ASR. The proposed pretraining method yields Word Error Rate (WER) reductions of 7-21% relative on six public corpora, Librispeech, AMI, TEDLIUM, CommonVoice, Switchboard, CHIME6, over a state-of-the-art baseline pretrained with wav2vec2.0 and 2-17% over the previously proposed tts4pretrain. The method outperforms supervised SpeechStew upto 17%. Moreover, we show that the proposed method yields WER reductions regardless of the size of the training corpora using a large resource, in-house Voice Search task and streaming ASR.

Tags:

speech synthesis

self-supervised

consistency regularization

speech recognition

Tts4pretrain 2.0: Advancing the use of text and speech in ASR pretraining with consistency and contrastive losses

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Gary Wang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Tutorial: Foundational Problems in Neural Speech Recognition

Conversational Speech Processing and Recognition: Speech Separation, End-to-End Modeling, and Speaker Diarization

GENERALIZED PSEUDO-LABELING IN CONSISTENCY REGULARIZATION FOR SEMI-SUPERVISED LEARNING

Join an IEEE Society