Towards Unsupervised Learning Of Speech Features In The Wild
Morgane Riviere, Emmanuel Dupoux
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:14:42
Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has been tested on clean, curated speech datasets. Can it also be used with unprepared audio data 鈥渋n the wild鈥? Here, we explore three problems that may hinder unsupervised learning in the wild: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a clean speech-only dataset, these problems combined can have a performance cost of up to 30% relative for the ABX score.We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech parts inside a file, and perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive segment of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.