VF-TACO2: TOWARDS FAST AND LIGHTWEIGHT SYNTHESIS FOR AUTOREGRESSIVE MODELS WITH VARIATION AUTOENCODER AND FEATURE DISTILLATION

Yuhao Liu ( Tianjin University); Cheng Gong (Tianjin University); Longbiao Wang (Tianjin University); Xixin Wu (The Chinese University of Hong Kong); Qiuyu Liu (Tianjin University); Jianwu Dang (Tianjin University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

With the development of deep learning, end-to-end neural text-to-speech (TTS) systems have achieved significant improvements in high-quality speech synthesis. However, most of these systems are attention-based autoregressive models, resulting in slow synthesis speed and large model parameter sizes. In this paper, we propose a new fast and lightweight TTS framework named VF-Taco2, which can quickly synthesize speech without GPUs. We first profiled the complexity of decoder process in the current autoregressive model and designed a novel multiple frames prediction module based on variational autoencoder (VAE) to alleviate quality degradation when a larger “reduction factor” is applied. Besides, feature distillation is leveraged to compress a relatively large proposed model to its small version with a minor loss of speech quality. Compared to the original Tacotron 2, our VF-Taco2 achieves a 3.6x-4.4x Mel-spectrum generation acceleration on different performance CPUs, and the parameters are compressed by 1.5x with speech quality maintained.

Tags:

Speech and singing voice synthesis/convertion/coding

VF-TACO2: TOWARDS FAST AND LIGHTWEIGHT SYNTHESIS FOR AUTOREGRESSIVE MODELS WITH VARIATION AUTOENCODER AND FEATURE DISTILLATION

Yuhao Liu ( Tianjin University); Cheng Gong (Tianjin University); Longbiao Wang (Tianjin University); Xixin Wu (The Chinese University of Hong Kong); Qiuyu Liu (Tianjin University); Jianwu Dang (Tianjin University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

DiffVoice: Text-to-Speech with Latent Diffusion

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Join an IEEE Society