Distance-based Weight Transfer for Fine-tuning from Near-field to Far-field Speaker Verification

Li Zhang (Northwestern Polytechnical University); Qing Wang (Northwestern Polytechnical University); Hongji Wang (None); Yue Li (Northwestern Polytechnical University); Wei Rao (Tencent); Yannan Wang (Tencent); Lei Xie (NWPU)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

The scarcity of labeled far-field speech is a constraint for training superior far-field speaker verification systems. In general, fine-tuning the model pre-trained on large-scale near-field speech through a small amount of far-field speech substantially outperforms training from scratch. However, the vanilla fine-tuning suffers from two limitations -- catastrophic forgetting and overfitting. In this paper, we propose a weight transfer regularization (WTR) loss to constrain the distance of the weights between the pre-trained model and the fine-tuned model. With the WTR loss, the fine-tuning process takes advantage of the previously acquired discriminative ability from the large-scale near-field speech and avoids catastrophic forgetting. Meanwhile, the analysis based on the PAC-Bayes generalization theory indicates that the WTR loss makes the fine-tuned model have a tighter generalization bound, thus mitigating the overfitting problem. Moreover, three different norm distances for weight transfer are explored, which are L1-norm distance, L2-norm distance and Max-norm distance. We evaluate the effectiveness of the WTR loss on VoxCeleb (pre-trained) and FFSVC (fine-tuned) datasets. Experimental results show that the distance-based weight transfer fine-tuning strategy significantly outperforms vanilla fine-tuning and other competitive domain adaptation methods.

Tags:

Speaker recognition/identification/diarization