-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 01:02:57
Many classical speech processing problems, such as enhancement, source separation, dereverberation, and bandwidth expansion, can be formulated as finding mapping functions to transform input to output spectra. Leveraging upon machine learning and big data paradigms, we cast these spectral mapping problems as learnable deep regression. Based on Komogorov’s Representation Theorem (1957), a multivariate scalar function can be expressed exactly as a superposition of a finite number of outer functions with another linear combination of inner functions embedded within. Cybenko (1989) developed a universal approximation theorem showing such a scalar function can be approximated by a superposition of sigmoid functions, inspiring a new wave of neural network algorithms. Barron (1993) later proved that the error in approximation can be tightly bounded and related to the representation power in learning theory. In this talk, we first present four new theorems to generalize the universal approximation theorems from sigmoid to deep neural networks (DNNs) and from vector-to-scalar to vector-to-vector regression. We also show that the generalization loss or regression error in machine learning theory can be decomposed into three terms, namely: approximation, estimation and optimization errors, such that each error term can be tightly bounded, separately.In practice, our developed theorems provide some guidelines for architecture selections in DNN designs. In a series of experiments for high-dimensional nonlinear regression, we validate our theory in terms of representation and generalization powers and demonstrate that, under adverse acoustic conditions, deep regression achieves a good speech quality and clear intelligibility for microphone-array based speech enhancement, separation and dereverberation. As a result, our proposed deep regression framework was also tested on many recent challenging tasks, including CHiME-2, CHiME-4, CHiME-5, CHiME-6, REVERB and DIHARD III. Our teams scored the lowest error rates in almost all the above-mentioned open evaluation scenarios. Finally, we believe a theoretical understanding of deep classification will be needed in order to advance automatic speech recognition and understanding (ASRU) technologies to the next level of performance and robustness.