Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:11:44
06 Oct 2022

Driven by the appeal of real-world applicable models, we investigate how temporal and spatial occlusion affect sign language recognition. Utilizing only a crop of the hands and pose flow, we maintain accuracies comparable to an I3D baseline for the WLASL dataset using a video transformer model (VTN), implying that hand crops might contain enough information for accurate prediction. Moreover, we find that a crop of only the right hand provides enough data to train an accurate model, achieving results of 0.2% less than the baseline for AUTSL and 4.7% less across all WLASL datasets. Sampling a video every fifth frame achieves comparative results to baseline, with 8 frame sequences performing better for AUTSL (0.4% less than baseline) and 16 frames performing better for WLASL (0.2% for WLASL 100 and 300). Our results indicate the feasibility of utilizing less information for sign language recognition, however more research is necessary to apply these findings in real-world scenarios.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00