Distortion-Controlled Training For End-To-End Reverberant Speech Separation With Auxiliary Autoencoding Loss
Yi Luo, Cong Han, Nima Mesgarani
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:15:23
The performance of speech enhancement and separation systems in anechoic environments has been significantly advanced with the recent progress in end-to-end neural network architectures. However, the performance of such systems in reverberant environments is yet to be explored. A core problem in reverberant speech separation is about the training and evaluation metrics. Standard time-domain metrics may introduce unexpected distortions during training and fail to properly evaluate the separation performance due to the presence of the reverberations. In this paper, we first introduce the ``equal-valued contour'' problem in reverberant separation where multiple outputs can lead to the same performance measured by the common metrics. We then investigate how ``better'' outputs with lower target-specific distortions can be selected by auxiliary autoencoding training (A2T). A2T assumes that the separation is done by a linear operation on the mixture signal, and it adds an loss term on the autoencoding of the direct-path target signals to ensure that the distortion introduced on the direct-path signals is controlled during separation. Evaluations on separation signal quality and speech recognition accuracy show that A2T is able to control the distortion on the direct-path signals and improve the recognition accuracy.