Self-adaptive Incremental Machine Speech Chain for Lombard TTS with High-granularity ASR Feedback in Dynamic Noise Condition

Sashi Novitasari (Nara Institute of Science and Technology); Sakriani Sakti (Japan Advanced Institute of Science and Technology); Satoshi Nakamura (Nara Institute of Science and Technology, Japan)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

A common approach for text-to-speech (TTS) in noisy conditions is offline fine-tuning, which is generally utilized on static noises and predefined conditions. We recently proposed a self-adaptive TTS in machine speech chain inference that enables TTS to control its voices in statically and dynamically noisy environments based on auditory feedback from automatic speech recognition (ASR) and speech-to-noise ratio (SNR) recognition. However, that study only investigated the system on synthetic Lombard speech data. Furthermore, the ASR feedback was at a lower granularity based only on the loss of the positive character class. In this paper, we improve the self-adaptive TTS using character-vocabulary level ASR feedback at higher granularity, considering the losses in the positive and negative classes. We focus on a self-adaptive incremental TTS (Adapt-ITTS) with a short-term feedback mechanism that aims for low latency adaptation for dynamically noisy situations. In experiments, our proposed Adapt-ITTS successfully improved intelligibility in noisy conditions based on synthetic and natural Lombard speech data on the Wall Street Journal and Hurricane datasets, respectively.

Tags:

Speech and singing voice synthesis/convertion/coding

Self-adaptive Incremental Machine Speech Chain for Lombard TTS with High-granularity ASR Feedback in Dynamic Noise Condition

Sashi Novitasari (Nara Institute of Science and Technology); Sakriani Sakti (Japan Advanced Institute of Science and Technology); Satoshi Nakamura (Nara Institute of Science and Technology, Japan)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

DiffVoice: Text-to-Speech with Latent Diffusion

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Join an IEEE Society