Hierarchical Diffusion Models for Singing Voice Neural Vocoder

Naoya Takahashi (Sony Group); Mayank Kumar Singh (Sony Research India); Yuki Mitsufuji (Sony Group Corporation)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. Experimental results show that the proposed method produces high-quality singing voices for multiple singers, outperforming state-of-the-art neural vocoders with a similar range of computational costs.

Tags:

Speech and singing voice synthesis/convertion/coding

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

Naoya Takahashi (Sony Group); Mayank Kumar Singh (Sony Research India); Yuki Mitsufuji (Sony Group Corporation)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

DiffVoice: Text-to-Speech with Latent Diffusion

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Join an IEEE Society