VQ-CL: Learning disentangled speech representations with contrastive learning and vector quantization

Huaizhen Tang (University of Science and Technology of China); Xulong Zhang (Ping An Technology (Shenzhen) Co., Ltd.); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Ning Cheng (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Voice Conversion(VC) refers to converting the voice characteristics of audio to another one as is said by other people. Recently, more and more studies have focused on disentangle-based VC, which separates the timbre and linguistic content information from a speech signal to effectively achieve VC tasks. However, It's still challenging to extract phoneme-level features from frame-level hidden representations. This paper proposed a novel zero-shot voice conversion framework that utilizes contrastive learning and vector quantization to encourage the frame-level hidden features closer to the phoneme-level linguistic information, called \textbf{VQ-CL}. All objective and subjective experiment results show that VQ-CL has better performance than previous studies in separating content and voice characteristics to improve the sound quality of generated speech.

Tags:

Speech and singing voice synthesis/convertion/coding

VQ-CL: Learning disentangled speech representations with contrastive learning and vector quantization

Huaizhen Tang (University of Science and Technology of China); Xulong Zhang (Ping An Technology (Shenzhen) Co., Ltd.); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Ning Cheng (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DiffVoice: Text-to-Speech with Latent Diffusion

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

DELIVERING SPEAKING STYLE IN LOW-RESOURCE VOICE CONVERSION WITH MULTI-FACTOR CONSTRAINTS

Join an IEEE Society