Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:14:03
13 May 2022

Pitch is a critical cue in human speech perception. Although the technology of tracking pitch in single-talker speech succeeds in many applications, it?s still a challenging problem to extract pitch information from mixtures. Inspired by the motor theory of speech perception, a novel multi-speaker pitch tracking approach is proposed in this work, based on an embodied self-supervised learning method (EMSSL-Pitch). The conceptual idea is that speech is produced through an underlying physical process (i.e., human vocal tract) given the articulatory parameters (articulatory-to-acoustic), while speech perception is like the inverse process, aiming at perceiving the intended articulatory gestures of the speaker from acoustic signals (acoustic-to-articulatory). Pitch value is part of the articulatory parameters, corresponding to the vibration frequency of vocal folders. The acoustic-to-articulatory inversion is modeled in a self-supervised manner to learn an inference network by iteratively sampling and training. The learned representations from this inference network can have explicit physical meanings, i.e., articulatory parameters where pitch information can be further extracted. Experiments on GRID database show that EMSSL-Pitch can achieve a reachable performance compared with supervised baselines and be generalized to unseen speakers.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • CIS
    Members: Free
    IEEE Members: Free
    Non-members: Free