Tackling Real Noisy Reverberant Meetings With All-Neural Source Separation, Counting, And Diarization System
Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 15:24
Automatic spoken conversation analysis is an essential fundamental technology required to let smart devices follow and respond to our conversations. To achieve an optimal automatic meeting analysis, we previously proposed an all-neural approach that jointly solves source separation, speaker diarization and source counting problems in an optimal way (in a sense that all the 3 tasks can be jointly optimized through error back-propagation). It was shown that the method could well handle simulated clean (noiseless and anechoic) dialog-like data, and achieve very good performance in comparison with several conventional methods. However, it was not clear whether such all-neural approach could be successfully generalized to more complicated real meeting data containing more spontaneously-speaking speakers, severe noise and reverberation, and how it would be compared to the state-of-the-art systems in such scenarios. In this paper, we first consider practical issues required for improving the robustness of the all-neural approach, and then experimentally show that, even in real meeting scenarios, the all-neural approach can perform effective speech enhancement, and simultaneously outperform state-of-the-art systems.