Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 15:13

04 May 2020

We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct an unseen temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join an IEEE Society