LEARNING-BASED PERSONAL SPEECH ENHANCEMENT FOR TELECONFERENCING BY EXPLOITING SPATIAL-SPECTRAL FEATURES

Yicheng Hsu, Yonghan Lee, Mingsian Bai

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:14

10 May 2022

Teleconferencing is becoming essential during the COVID-19 pandemic. However, in real-world applications, speech quality can deteriorate due to, for example, background interference, noise, or reverberation. To solve this problem, target speech extraction from the mixture signals can be performed with the aid of the user?s vocal features. Various features are accounted for in this study?s proposed system, including speaker embeddings derived from user enrollment and a novel long-short-term spatial coherence (LSTSC) feature pertaining to the target speaker activity. As a learning-based approach, a target speech sifting network was employed to extract the target signal. The network trained with LSTSC in the proposed approach is robust to microphone array geometries and the number of microphones. Furthermore, the proposed enhancement system was compared with a baseline system with speaker embeddings and interchannel phase difference. The results demonstrated the superior performance of the proposed system over the baseline in enhancement performance and robustness.

Tags:

target speech enhancement

speaker embedding

convolutional recurrent neural network

spatial coherence analysis

LEARNING-BASED PERSONAL SPEECH ENHANCEMENT FOR TELECONFERENCING BY EXPLOITING SPATIAL-SPECTRAL FEATURES

Yicheng Hsu, Yonghan Lee, Mingsian Bai

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Towards End-to-End Speaker Diarization with Generalized Neural Speaker Clustering

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

MLP-SVNET : A MULTI-LAYER PERCEPTRONS BASED NETWORK FOR SPEAKER VERIFICATION

Join an IEEE Society