An Investigation Into The Multi-Channel Time Domain Speaker Extraction Network
Catalin Zorila, Mohan Li, Rama Doddipatla
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:12:06
This paper presents an investigation into the effectiveness of spatial features for improving time-domain speaker extraction systems. A two-dimensional Convolutional Neural Network (CNN) based encoder is proposed to capture the spatial information within the multichannel input, which are then combined with the spectral features of a single channel extraction network. Two variants of target speaker extraction methods were tested, one which employs a pre-trained i-vector system to compute a speaker embedding (System A), and one which employs a jointly trained neural network to extract the embeddings directly from time domain enrolment signals (System B). The evaluation was performed on the spatialized WSJ0-2mix dataset using the Signal-to-Distortion Ratio (SDR) metric, and ASR accuracy. In the anechoic condition, more than 10 dB and 7 dB absolute SDR gains were achieved when the 2-D CNN spatial encoder features were included with Systems A and B, respectively. The performance gains in reverberation were lower, however, we have demonstrated that retraining the systems by applying dereverberation preprocessing can significantly boost both the target speaker extraction and ASR performances.