MULTI-SPEAKER PITCH TRACKING VIA EMBODIED SELF-SUPERVISED LEARNING

Xiang Li, Yifan Sun, Xihong Wu, Jing Chen

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:03

13 May 2022

Pitch is a critical cue in human speech perception. Although the technology of tracking pitch in single-talker speech succeeds in many applications, it?s still a challenging problem to extract pitch information from mixtures. Inspired by the motor theory of speech perception, a novel multi-speaker pitch tracking approach is proposed in this work, based on an embodied self-supervised learning method (EMSSL-Pitch). The conceptual idea is that speech is produced through an underlying physical process (i.e., human vocal tract) given the articulatory parameters (articulatory-to-acoustic), while speech perception is like the inverse process, aiming at perceiving the intended articulatory gestures of the speaker from acoustic signals (acoustic-to-articulatory). Pitch value is part of the articulatory parameters, corresponding to the vibration frequency of vocal folders. The acoustic-to-articulatory inversion is modeled in a self-supervised manner to learn an inference network by iteratively sampling and training. The learned representations from this inference network can have explicit physical meanings, i.e., articulatory parameters where pitch information can be further extracted. Experiments on GRID database show that EMSSL-Pitch can achieve a reachable performance compared with supervised baselines and be generalized to unseen speakers.

Tags:

speech perception

self-supervised learning

speech production

multi-pitch tracking

MULTI-SPEAKER PITCH TRACKING VIA EMBODIED SELF-SUPERVISED LEARNING

Xiang Li, Yifan Sun, Xihong Wu, Jing Chen

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICIP 2023 COURSE 2: Short Course: Unboxing Advancements in Biomedical Image Processing (Parts 1-4)

The Changing Landscape of Speech Foundation Models

Slides: The Changing Landscape of Speech Foundation Models

Join an IEEE Society