Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 0:14:42
19 Jan 2021

Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has been tested on clean, curated speech datasets. Can it also be used with unprepared audio data 鈥渋n the wild鈥? Here, we explore three problems that may hinder unsupervised learning in the wild: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a clean speech-only dataset, these problems combined can have a performance cost of up to 30% relative for the ABX score.We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech parts inside a file, and perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive segment of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00