Singing Voice Synthesis Based on a Musical Note Position-aware Attention Mechanism

Yukiya Hono (Nagoya Institute of Technology); Kei Hashimoto (Nagoya Institute of Technology); Yoshihiko Nankaku (Nagoya Institute of Technology); Keiichi Tokuda (Department of Computer Science and Engineering, Nagoya Institute of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model still rely on explicitly on duration information generated by additional modules. Although some studies perform simultaneous modeling using seq2seq models with an attention mechanism, they have insufficient robustness against temporal modeling. The proposed attention mechanism is designed to estimate the attention weights by considering the rhythm given by the musical score. Furthermore, several techniques are also introduced to improve the modeling performance of the singing voice. Experimental results indicated that the proposed model is effective in terms of both naturalness and robustness of timing.

Tags:

Speech and singing voice synthesis/convertion/coding

Singing Voice Synthesis Based on a Musical Note Position-aware Attention Mechanism

Yukiya Hono (Nagoya Institute of Technology); Kei Hashimoto (Nagoya Institute of Technology); Yoshihiko Nankaku (Nagoya Institute of Technology); Keiichi Tokuda (Department of Computer Science and Engineering, Nagoya Institute of Technology)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DiffVoice: Text-to-Speech with Latent Diffusion

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

DELIVERING SPEAKING STYLE IN LOW-RESOURCE VOICE CONVERSION WITH MULTI-FACTOR CONSTRAINTS

Join an IEEE Society