CONTENT-INSENSITIVE DYNAMIC LIP FEATURE EXTRACTION FOR VISUAL SPEAKER AUTHENTICATION AGAINST DEEPFAKE ATTACKS
Zihao Guo (Shanghai Jiao Tong University); shilin wang (SEIEE, Shanghai Jiaotong University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Recent research has shown that lip-based speaker authentication system can achieve good authentication performance. However, with emerging deepfake technology, attackers can make high fidelity talking videos of a user, thus posing a great threat to these systems. Confronted with this threat, we propose a new deep neural network for lip-based visual speaker authentication against human imposters and deepfake attacks. One dynamic enhanced block with context modeling scheme is designed to capture a user’s unique talking habit by learning from his/her lip movement. Meanwhile, a cross-modality content-guided loss is designed to help extract discriminative features when learning from different lip movement of a user uttering different content. This loss makes the proposed method insensitive to content variation. Experiments on the GRID dataset show that the proposed method not only outperforms three state-of-the-art methods but also simplifies the training process and reduces the training cost.