PROSODY-AWARE SPEECHT5 FOR EXPRESSIVE NEURAL TTS

Yan Deng (Microsoft); Long Zhou (Microsoft Research Asia); Yuanhao Yi (Microsoft); Shujie Liu (Microsoft Research Asia); Lei He (Microsoft Cloud and AI)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

SpeechT5, a multimodal learning framework which explores encoder-decoder pre-training by leveraging both unlabeled speech and text, has been proven to be effective on a wide variety of speech processing tasks. In this paper, we enhance SpeechT5 by adding a new sub-task on prosody modeling (prosody-aware SpeechT5) for neural text-to-speech (TTS), which can improve the model capability to learn richer contextual representations through multi-task learning. In the prosody-aware SpeechT5 training framework, most modules in neural TTS can be pre-trained with large-scale unlabeled speech and text corpus, including encoder, decoder, and variance adaptor. Experimental results show that the proposed prosody-aware SpeechT5 is effective at improving the expressiveness of neural TTS: 1) the CMOS (comparison mean opinion score) gain is 0.154 for texts from news domain and 0.114 for texts from audiobook domain; 2) the prosody related issues in synthetic speech are reduced by 19.02% in subjective evaluation.

Tags:

Speech and singing voice synthesis/convertion/coding

PROSODY-AWARE SPEECHT5 FOR EXPRESSIVE NEURAL TTS

Yan Deng (Microsoft); Long Zhou (Microsoft Research Asia); Yuanhao Yi (Microsoft); Shujie Liu (Microsoft Research Asia); Lei He (Microsoft Cloud and AI)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

DELIVERING SPEAKING STYLE IN LOW-RESOURCE VOICE CONVERSION WITH MULTI-FACTOR CONSTRAINTS

IMPROVED APPLIANCE TRANSIENT FEATURE EXTRACTION VIA TEMPLATE MATCHING

Join an IEEE Society