Replacing Human Audio With Synthetic Audio For On-Device Unspoken Punctuation Prediction

Daria Soboleva, Ondrej Skopek, Márius Šajgalík

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:29

10 Jun 2021

We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.

Chairs:

Thomas Drugman

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021

Replacing Human Audio With Synthetic Audio For On-Device Unspoken Punctuation Prediction

Daria Soboleva, Ondrej Skopek, Márius Šajgalík

Value-Added Bundle(s) Including this Product

ICASSP 2021 Virtual Conference - Presentation Videos Product Bundle

More Like This

Keynote: Navigating the Transition to Sustainable Energy Solutions in a Power-Hungry World

Panel: Leveraging Technology to Achieve Carbon Neutrality of Buildings and Factories

Panel: Charting the Course for Future-Ready Data Centers in the Era of Sustainability

Join an IEEE Society