StreamSpeech: Low-Latency Neural Architecture for High-Quality On-Device Speech Synthesis
Georgi S Shopov (IICT-BAS); Stefan Gerdjikov (FMI, Sofia University); Stoyan Mihov (IICT-BAS)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Neural TTS systems have recently demonstrated the ability to synthesize high-quality natural speech. However, the inference latency and RTF of such systems are still too high for deployment on devices without specialized hardware. In this paper, we describe StreamSpeech - an optimized architecture of a complete TTS system that produces high-quality speech and runs faster than real time with imperceptible latency on resource-constrained devices by utilizing a single CPU core. We divide the standard TTS processing pipeline into three phases with respect to their operating resolution and optimize them separately. Our main novel contribution is the introduction of a lightweight convolutional acoustic model decoder, which enables streaming and low-latency speech generation. Experiments show that the resulting complete TTS system achieves 79 ms latency, 0.155 RTF on a low-power notebook x86 CPU and 276 ms latency, 0.289 RTF on a mid-range mobile ARM CPU without affecting the speech quality.