Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation

Kazuhiro Kobayashi (Nagoya University); Tomoki Hayashi (Nagoya University); Tomoki Toda (Nagoya University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

In this paper, we propose a low-latency sequence-to-sequence speech enhancement technique for electrolaryngeal (EL) speech. A low-latency EL speech enhancement technique based on CLDNN was previously proposed to enable laryngectomees to produce relatively naturally sounding speech compared to the original EL speech. However, the naturalness of the enhanced speech is still far from that of the natural speech due to insufficient modeling accuracy of an acoustic feature sequence and speech waveform caused by frame-wise conversion processing. To solve this problem, in this paper, we propose a low-latency sequence-to-sequence EL speech enhancement technique based on CLDNN-FastSpeech2-based VC. Moreover, to improve various factors such as naturalness, intelligibility, and robustness, we also propose the following techniques: 1) utilizing a self-supervised speech representation (SSL) and 2) randomly sampling the predicted and ground-truth features to alleviate prediction errors in variance adaptor. The experimental results demonstrate that the proposed method yields better performance in both objective and subjective evaluations compared to the conventional frame-based EL speech enhancement technique.

Tags:

Speech and singing voice synthesis/convertion/coding

Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation

Kazuhiro Kobayashi (Nagoya University); Tomoki Hayashi (Nagoya University); Tomoki Toda (Nagoya University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

IMPROVED APPLIANCE TRANSIENT FEATURE EXTRACTION VIA TEMPLATE MATCHING

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Join an IEEE Society