Enhancement of text-predicting style token with generative adversarial network for expressive speech synthesis

Hiroki Kanagawa (NTT Corporation); Yusuke Ijima (NTT Corporation)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

This work proposes an advanced text-predicting style embedding for expressive speech synthesis. Text-predicting global style token (TPGST) predicts style embedding from text instead of reference speech and uses it to condition a text-to-speech synthesis (TTS) model, resulting in style TTS without reference speech. Although this minimizes the style embedding's L1-loss between that extracted from reference speech and that predicted during training, predicted embedding tends to be over-smoothed. To overcome this issue, the proposed method uses the generative adversarial network (GAN) in training style predictors. This not only improves style reproduction, but also aims to reduce style conditioning mismatch during TTS model training. We also utilize TTS text embeddings as in other related work, as well as word information via BERT in order to find better style distributions in GAN. An evaluation of subjective style reproduction demonstrates that 1) the proposed method outperforms conventional TPGST, and 2) the use of words yielded by BERT provides even better performance. Our style predictor is also effective in attaining unseen style TTS for seen and unseen speakers.

Tags:

Speech and singing voice synthesis/convertion/coding

Enhancement of text-predicting style token with generative adversarial network for expressive speech synthesis

Hiroki Kanagawa (NTT Corporation); Yusuke Ijima (NTT Corporation)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

IMPROVED APPLIANCE TRANSIENT FEATURE EXTRACTION VIA TEMPLATE MATCHING

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Join an IEEE Society