Evaluating Speechâ€“Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis

Frank Zalkow (Fraunhofer IIS); Prachi Govalkar (Fraunhofer IIS); Meinard Müller (International Audio Laboratories Erlangen); Emanuel Habets (Fraunhofer IIS); Christian Dittmar (Fraunhofer IIS)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

In recent years, the quality of text-to-speech (TTS) synthesis vastly improved due to deep-learning techniques, with parallel architectures, in particular, providing excellent synthesis quality at fast inference. Training these models usually requires speech recordings, corresponding phoneme-level transcripts, and the temporal alignment of each phoneme to the utterances. Since manually creating such fine-grained alignments requires expert knowledge and is time-consuming, it is common practice to estimate them using automatic speech–phoneme alignment methods. In the literature, either the estimation methods' accuracy or their impact on the TTS system's synthesis quality is evaluated. In this study, we perform experiments with five state-of-the-art speech–phoneme aligners and evaluate their output with objective and subjective measures. As our main result, we show that small alignment errors (below 75 ms error) do not decrease the synthesis quality, which implies that the alignment error may not be the crucial factor when choosing an aligner for TTS training.

Tags:

Segmentation, tagging, and parsing

Evaluating Speechâ€“Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis

Frank Zalkow (Fraunhofer IIS); Prachi Govalkar (Fraunhofer IIS); Meinard Müller (International Audio Laboratories Erlangen); Emanuel Habets (Fraunhofer IIS); Christian Dittmar (Fraunhofer IIS)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

SIAST: A Slot Imbalance-Aware Self-Training Scheme for Semi-Supervised Slot Filling

Absolute decision corrupts absolutely: conservative online speaker diarisation

ANCIENT CHINESE WORD SEGMENTATION AND PART-OF-SPEECH TAGGING USING DISTANT SUPERVISION

Join an IEEE Society