Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:08:49
10 Jun 2021

Automatic pronunciation assessment plays an important role in Computer-Assisted Pronunciation Training (CAPT). Traditional methods for pronunciation assessment of reading aloud tasks utilize features derived from automatic speech recognition (ASR) and thus are sensitive to the accuracy of ASR and the effectiveness of features. Moreover, the representation capability of the features is also affected by the inconsistent optimization goals between the ASR and scoring tasks. In this paper we propose an end-to-end (E2E) pronunciation scoring network based on attention mechanism and multi-encoder consisting of audio and text encoders. The network optimized by a multi-task learning (MTL) framework can provide scoring at sentence-level as well as detailed scoring at word-level. Due to data scarcity for pronunciation scoring, we utilize ASR data and synthetic data to pre-train the network in two steps, and then fine-tune the network using the limited high-quality scoring data. Experimental results based on the dataset recorded by Chinese English-as-second-language (ESL) learners and labeled by three experts demonstrate that the proposed model outperforms the baseline in Pearson correlation coefficient (PCC).

Chairs:
Eric Fosler-Lussier

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $25.00
    Non-members: $40.00
  • SPS
    Members: Free
    IEEE Members: Free
    Non-members: Free