MULTI-TASK LEARNING IMPROVES SYNTHETIC SPEECH DETECTION
Yichuan Mo, Shilin Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:06:56
With the development of deep learning, synthetic speech has become more and more realistic and easier to spoof Automatic Speaker Verification (ASV) devices. Based on mining more effective hand-crafted features and proposing more powerful networks, many algorithms have been proposed to detect this malicious attack. In this paper, by observing that deepening the network impairs the performance of the network in detecting unknown attacks, we propose that the synthetic speech detection problem is an out-of-distribution (OOD) generalization problem and we enhance the robustness of networks by using multi-task learning. In our system, three auxiliary tasks are used to assist synthetic speech detection: bonafide speech reconstruction, spoofing voice conversion and speaker classification. Experimental results show that our approach can be applied to multiple architectures and can significantly improve the performance on both known attacks (development set) and unknown attacks (evaluation set). In addition, our best-performing network is quite competitive to recent state-of-the-art (SOTA) systems. It demonstrates the potential application of multi-task learning in synthetic speech detection.