Skip to main content

Tts4pretrain 2.0: Advancing the use of text and speech in ASR pretraining with consistency and contrastive losses

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Gary Wang

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:05:11
11 May 2022

An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical representations derived from synthesized speech was introduced in tts4pretrain. However, the representations learned from synthesized and real speech are likely to be different, potentially limiting the improvements from incorporating unspoken text. In this paper, we introduce learning from supervised speech earlier on in the training process with consistency-based regularization between real and synthesized speech. This allows for better learning of shared speech and text representations. Thus, we introduce a new objective, with encoder and decoder consistency and contrastive regularization between real and synthesized speech derived from the labeled corpora during the pretraining stage. We show that the new objective leads to more similar representations derived from speech and text that helps downstream ASR. The proposed pretraining method yields Word Error Rate (WER) reductions of 7-21% relative on six public corpora, Librispeech, AMI, TEDLIUM, CommonVoice, Switchboard, CHIME6, over a state-of-the-art baseline pretrained with wav2vec2.0 and 2-17% over the previously proposed tts4pretrain. The method outperforms supervised SpeechStew upto 17%. Moreover, we show that the proposed method yields WER reductions regardless of the size of the training corpora using a large resource, in-house Voice Search task and streaming ASR.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00