Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:14:47
13 May 2022

While recent end-to-end automatic speech recognition (ASR) models achieve high performance, we need to prepare an abundant amount of training data. To mitigate the lack of training data, text-to-speech systems have been utilized to leverage text-only data to efficiently generate paired data for training the ASR model. The widely-used procedure first generates a Mel spectrogram from text data, then converts it into a waveform, and converts it again to a Mel spectrogram. The vocoder is used to alleviate the difference between real and synthesized speech, but it requires a huge amount of runtime. In this work, we propose a phone-informed post-processing network that refines Mel spectrograms without using the vocoder. The proposed network consumes not only Mel spectrograms but also text information of the speech for phone-informed refinement. Experimental evaluations demonstrate that the proposed network achieves better WERs than the vocoder network in an English domain adaptation task (LibriSpeech to TED-LIUM 2; read speech to spontaneous speech) in a much smaller amount of data generation time, and the use of phone information is critical for the improvement. We also confirm the effect of the proposed model in a Japanese domain adaptation task (CSJ-SPS to CSJ-APS; everyday topic to academic topic).

More Like This

  • SPS
    Members: $10.00
    IEEE Members: $22.00
    Non-members: $30.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00