Advances on Multimodal Machine Learning Solutions for Speech Processing Tasks and Emotion Recognition
Dr. Fei Tao, Dr. Carlos Busso
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:12
Recent advances in multimodal processing have led to promising solutions for speech-processing tasks. One example is automatic speech recognition (ASR), which is a key component in current speech-based systems. Since the surrounding acoustic noise can severely degrade the performance of an ASR system, an appealing solution is to augment conventional audio-based ASR systems with visual features describing lip activity. We describe a novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system. A key novelty of the approach is the use of MTL, where the primary task is AV-ASR, and the secondary task is audiovisual voice activity detection (AV-VAD). We obtain a robust and accurate audiovisual system that generalizes across conditions. By detecting segments with speech activity, the AV-ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage from the AV-VAD alignment information. Furthermore, the end-to-end system learns from the raw audiovisual inputs a discriminative high-level representation for both speech tasks, providing the flexibility to mine information directly from the data. The proposed architecture considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme. In addition to state-of-the-art performance in AV-ASR, the proposed solution can also provide valuable information about speech activity, solving two of the most important tasks in speech-based applications. This webinar will also discuss advances of multimodal solutions for emotion recognition. We describe multimodal pre-text tasks that are carefully designed to learn better representations for predicting emotional cues from speech, leveraging the relationship between acoustic and facial features. We also discuss our current effort to design multimodal emotion recognition strategies that effectively combine auxiliary networks, a transformer architecture, and an optimized training mechanism for aligning modalities, capturing temporal information, and handling missing features. These models offer principled solutions to increase the generalization and robustness of emotion recognition systems.