VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

Naoyuki Kanda (Microsoft); Jian Wu (Microsoft); Xiaofei Wang (Microsoft); Zhuo Chen (Microsoft); Jinyu Li (Microsoft); Takuya Yoshioka (Microsoft)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant-microphone setting while retaining the streaming inference capability.

Tags:

New algorithms and approaches for speech recognition

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

Naoyuki Kanda (Microsoft); Jian Wu (Microsoft); Xiaofei Wang (Microsoft); Zhuo Chen (Microsoft); Jinyu Li (Microsoft); Takuya Yoshioka (Microsoft)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

PRACTICE OF THE CONFORMER ENHANCED AUDIO-VISUAL HUBERT ON MANDARIN AND ENGLISH

A Quantum Approach for Stochastic Constrained Binary Optimization

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

Join an IEEE Society