MULTI-SPEAKER EXPRESSIVE SPEECH SYNTHESIS VIA MULTIPLE FACTORS DECOUPLING

Xinfa Zhu (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Kun Song (Northwestern Polytechnical University); yongmao zhang (Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China); Tao Li (School of Computer Science, Northwestern Polytechnical University, Xi’an); Lei Xie (NWPU)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.

Tags:

Speech and singing voice synthesis/convertion/coding

MULTI-SPEAKER EXPRESSIVE SPEECH SYNTHESIS VIA MULTIPLE FACTORS DECOUPLING

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

DELIVERING SPEAKING STYLE IN LOW-RESOURCE VOICE CONVERSION WITH MULTI-FACTOR CONSTRAINTS

IMPROVED APPLIANCE TRANSIENT FEATURE EXTRACTION VIA TEMPLATE MATCHING

Join an IEEE Society