Non-Parallel Many-To-Many Voice Conversion By Knowledge Transfer From A Text-To-Speech Model

Xinyuan Yu, Brian Mak

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:39

08 Jun 2021

In this paper, we present a simple but novel framework to train a non-parallel many-to-many voice conversion (VC) model based on the encoder-decoder architecture. It is observed that an encoder-decoder text-to-speech (TTS) model and an encoder-decoder VC model have the same structure. Thus, we propose to pre-train a multi-speaker encoder-decoder TTS model and transfer knowledge from the TTS model to a VC model by (1) adopting the TTS acoustic decoder as the VC acoustic decoder, and (2) forcing the VC speech encoder to learn the same speaker-agnostic linguistic features from the TTS text encoder so as to achieve speaker disentanglement in the VC encoder output. We further control the conversion of the pitch contour from source speech to target speech, and condition the VC decoder on the converted pitch contour during inference. Subjective evaluation shows that our proposed model is able to handle VC between any speaker pairs in the training speech corpus of over 200 speakers with high naturalness and speaker similarity.

Chairs:

Tomoki Toda

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021

Non-Parallel Many-To-Many Voice Conversion By Knowledge Transfer From A Text-To-Speech Model

Xinyuan Yu, Brian Mak

Value-Added Bundle(s) Including this Product

ICASSP 2021 Virtual Conference - Presentation Videos Product Bundle

More Like This

Keynote: Innovating for Product Sustainability – Making Data Centers Greener

Panel: Navigating Green: Regulatory Insights and Compliance Strategies for Building a Sustainable Future

Sustainability Start-up Pitch Competition

Join an IEEE Society