Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:09:50
08 May 2022

Audio-visual (AV)-automatic speech recognition (ASR) can improve speech recognition accuracy by using lip images, especially in noisy environments. The recently proposed AV Align system integrates speech and image features based on a cross-modal attention mechanism, where attention weights for visual features are estimated by using acoustic features as queries. Although AV Align shows an improvement in recognition accuracy in background noise environments, we have observed that the recognition accuracy degrades significantly in interference speaker environments, where a target speech and an interfering speech overlap each other. In order to improve the speech recognition accuracy of the target speaker in such situations, we propose a method that combines the auxiliary loss function that maximizes the recognition accuracy of the interference speaker and the CTC loss function for training the AV-ASR model. The experimental results using the TCD-TIMIT dataset show that the use of these auxiliary loss functions improves the performance of target-speaker speech recognition in interference speaker environments.

More Like This

  • SPS
    Members: $10.00
    IEEE Members: $22.00
    Non-members: $30.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00