Transformer-Based Text-To-Speech With Weighted Forced Attention

Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 11:20

04 May 2020

This paper investigates state-of-the-art Transformer- and FastSpeech-based high-fidelity neural text-to-speech (TTS) with full-context label input for pitch accent languages. The aim is to realize faster training than conventional Tacotron-based models. Introducing phoneme durations into Tacotron-based TTS models improves both synthesis quality and stability. Therefore, a Transformer-based acoustic model with weighted forced attention obtained from phoneme durations is proposed to improve synthesis accuracy and stability, where both encoderâdecoder attention and forced attention are used with a weighting factor. Furthermore, FastSpeech without a duration predictor, in which the phoneme durations are predicted by another conventional model, is also investigated. The results of experiments using a Japanese female corpus and the WaveGlow vocoder indicate that the proposed Transformer using forced attention with a weighting factor of 0.5 outperforms other models, and removing the duration predictor from FastSpeech improves synthesis quality, although the proposed weighted forced attention does not improve synthesis stability.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Transformer-Based Text-To-Speech With Weighted Forced Attention

Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join an IEEE Society