End-To-End Multi-Channel Transformer For Speech Recognition

Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, Siegfried Kunzmann

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:13:46

08 Jun 2021

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship "within" and "between" channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.

Chairs:

Yangyang Shi

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021

End-To-End Multi-Channel Transformer For Speech Recognition

Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, Siegfried Kunzmann

Value-Added Bundle(s) Including this Product

ICASSP 2021 Virtual Conference - Presentation Videos Product Bundle

More Like This

Keynote: Navigating the Transition to Sustainable Energy Solutions in a Power-Hungry World

Panel: Leveraging Technology to Achieve Carbon Neutrality of Buildings and Factories

Panel: Charting the Course for Future-Ready Data Centers in the Era of Sustainability

Join an IEEE Society