Preserving background sound in noise-robust voice conversion via multi-task learning

Jixun Yao (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Qing Wang (Northwestern Polytechnical University); Pengcheng Guo (Northwestern Polytechnical University); Ziqian Ning (Northwestern Polytechnical University); Lei Xie (NWPU); Hai Li (iQIYI Inc); Junhui Liu (iQIYI Inc); Danming Xie (iQIYI)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.

Tags:

Speech and singing voice synthesis/convertion/coding

Preserving background sound in noise-robust voice conversion via multi-task learning

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

DELIVERING SPEAKING STYLE IN LOW-RESOURCE VOICE CONVERSION WITH MULTI-FACTOR CONSTRAINTS

IMPROVED APPLIANCE TRANSIENT FEATURE EXTRACTION VIA TEMPLATE MATCHING

Join an IEEE Society