Towards Generalizable Deepfake Face Forgery Detection With Semi-Supervised Learning and Knowledge Distillation
Yuzhen Lin, Han Chen, Bin Li, Junqiang Wu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:11:44
Driven by the appeal of real-world applicable models, we investigate how temporal and spatial occlusion affect sign language recognition. Utilizing only a crop of the hands and pose flow, we maintain accuracies comparable to an I3D baseline for the WLASL dataset using a video transformer model (VTN), implying that hand crops might contain enough information for accurate prediction. Moreover, we find that a crop of only the right hand provides enough data to train an accurate model, achieving results of 0.2% less than the baseline for AUTSL and 4.7% less across all WLASL datasets. Sampling a video every fifth frame achieves comparative results to baseline, with 8 frame sequences performing better for AUTSL (0.4% less than baseline) and 16 frames performing better for WLASL (0.2% for WLASL 100 and 300). Our results indicate the feasibility of utilizing less information for sign language recognition, however more research is necessary to apply these findings in real-world scenarios.