PERCEPTUAL-SIMILARITY-AWARE DEEP SPEAKER REPRESENTATION LEARNING FOR MULTI-SPEAKER GENERATIVE MODELING
Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:20
We propose novel algorithms for incorporating perceptual similarity among speakers into deep speaker representation learning. The proposed speaker representation learning algorithms use a perceptual speaker similarity matrix obtained from large-scale perceptual scoring as the target for the speaker encoder training. The algorithms learn speaker embeddings with three different representations of the matrix: a set of vectors, the Gram matrix, and a graph. To reduce costs of scoring and training, we further propose an active learning algorithm that iterates the perceptual similarity scoring and speaker encoder training. The algorithm selects speaker pairs to be scored next based on the sequentially-trained speaker encoder's similarity prediction results. The evaluation results demonstrate that 1) our representation learning algorithms learn speaker embeddings strongly correlated with perceptual similarity scores, 2) the embeddings improve synthetic speech quality in speech autoencoding tasks better than conventional d-vectors obtained by discriminative modeling, 3) our active learning algorithm achieves higher synthetic speech quality while reducing costs of scoring and training, and 4) among the proposed similarity {vector, matrix, graph} embedding algorithms, the first achieves the best speaker similarity for synthetic speech, and the third gives the most improvement in the synthetic speech naturalness.