Unsupervised Acoustic-To-Articulatory Inversion Neural Network Learning Based On Deterministic Policy Gradient
Hayato Shibata, Mingxin Zhang, Takahiro Shinozaki
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:14:15
This paper presents an unsupervised learning method of deep neural networks that perform acoustic-to-articulatory inversion for arbitrary utterances. Conventional unsupervised acoustic-to-articulatory inversion methods are based on the analysis-by-synthesis approach and non-linear optimization algorithms. One limitation is that they require time-consuming iterative optimizations to obtain articulatory parameters for a given target speech segment. Neural networks, after learning their relationship, can obtain these articulatory parameters without an iterative optimization. However, conventional methods need supervised learning and paired acoustic and articulatory samples. We propose a hybrid auto-encoder based unsupervised learning framework for the acoustic-to-articulatory inversion neural networks that can capture context information. The essential point of the framework is making the training effective. We investigate several reinforcement learning algorithms and show the usefulness of the deterministic policy gradient. Experimental results demonstrate that the proposed method can infer articulatory parameters not only for training set segments but also for unseen utterances. Averaged reconstruction errors achieved for open test samples are similar to or even lower than the conventional method that directly optimizes the articulatory parameters in a closed condition.