AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:35

13 May 2022

The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. We present AudioCLIP ? an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audio-model into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP?s zero-shot capabilities. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15% on ESC-50 and 90.07% on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69.40% and 68.78%, respectively). We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.

Tags:

multimodal

zero-shot

classification

audio

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

KEYNOTE: Least Squares Support Vector Machines and Deep Learning

KEYNOTE: Evolutionary Machine Learning: 50 Years of Progress

Quantum Tensor Networks in Machine Learning and Artificial Intelligence Video

Join an IEEE Society