Pitch Estimation Via Self-Supervision
Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi, Mihajlo Velimirovic
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:06
We present a method to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. In contrast to existing methods, our neural network can be fully trained only on unlabeled data, using self-supervision. A tiny amount of labeled data is needed solely for mapping the network outputs to absolute pitch values. The key to this is the observation that if one creates two examples from one original audio clip by pitch shifting both, the difference between the correct outputs is known, without even knowing the actual pitch value in the original clip. Somewhat surprisingly, this idea combined with an auxiliary reconstruction loss allows training a pitch estimation model. Our results show that our pitch estimation method obtains an accuracy comparable to fully supervised models on monophonic audio, without the need for large labeled datasets. In addition, we are able to train a voicing detection output in the same model, again without using any labels.