Adi17: A Fine-Grained Arabic Dialect Identification Dataset

Ahmed Ali, Younes Samih, Hamdy Mubarak, Suwon Shon, James Glass

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 11:39

04 May 2020

In this paper, we describe a method to collect dialectal speech from YouTube videos to create a large-scale Dialect Identification (DID) dataset. Using this method, we collected dialectal Arabic from known YouTube channels from 17 Arabic speaking countries in the Middle East and Northern Africa. After a refinement process, a total of 3,000 hours of speech was available for training DID systems, with an additional 57 hours of speech for development and testing. For detailed evaluations, the DID data was divided into three sub-categories based on the segment duration: short (less than 5s), medium (5â20s), and long (over 20s). We compare state-of-the-art DID techniques on these data, and also analyze a DID system trained on these data. Since the training and test data share the same channel domain, we also used the Multi-Genre Broadcast 3 (MGB-3) test set to evaluate on domain mismatched condition.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Adi17: A Fine-Grained Arabic Dialect Identification Dataset

Ahmed Ali, Younes Samih, Hamdy Mubarak, Suwon Shon, James Glass

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join an IEEE Society