CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

Kevin Joo, Khoa Vo, Kashu Yamazaki, Ngan Le

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Lecture 10 Oct 2023

Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA

Tags:

video anomaly detection

temporal self-attention

weakly supervised

multimodal model

subtlety

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

Kevin Joo, Khoa Vo, Kashu Yamazaki, Ngan Le

More Like This

USURP: UNIVERSAL SINGLE-SOURCE ADVERSARIAL PERTURBATIONS ON MULTIMODAL EMOTION RECOGNITION

MIXED IN TIME AND MODALITY: CURSE OR BLESSING? CROSS-INSTANCE DATA AUGMENTATION FOR WEAKLY SUPERVISED MULTIMODAL TEMPORAL FUSION

IMPORTANCE SAMPLING CAMS FOR WEAKLY-SUPERVISED SEGMENTATION

Join an IEEE Society